Key Components of an RSP

This document lists some key components of a responsible scaling policy, with rough sample language for each.

In brief, the key components are:

Limits: which specific observations about dangerous capabilities would indicate that it is (or strongly might be) unsafe to continue scaling? More below.
Protections: what aspects of current protective measures are necessary to contain catastrophic risks? More below.
Evaluation: what are the procedures for promptly catching early warning signs of dangerous capability limits? More below.
Response: if dangerous capabilities go past the limits and it’s not possible to improve protections quickly, is the AI developer prepared to pause further capability improvements until protective measures are sufficiently improved, and treat any dangerous models with sufficient caution? More below.
Accountability: how does the AI developer ensure that the RSP’s commitments are executed as intended; that key stakeholders can verify that this is happening (or notice if it isn’t); that there are opportunities for third-party critique; and that changes to the RSP itself don’t happen in a rushed or opaque way? More below.

We also discuss some alternative ways one could construct an RSP.

Limits: which specific observations about dangerous capabilities would indicate that it is (or strongly might be) unsafe to continue scaling?

An RSP should list specific capabilities that would be beyond what current protective measures can safely handle, and specific observations (e.g., results of specific tests) that would indicate that there’s a substantial risk of these capabilities.

Dangerous capabilities will likely often be hand-in-hand with new beneficial capabilities, but despite this may require new protective measures.

It’s reasonable to build in some discretion and flexibility here - for example, something like “If an AI demonstrates capability X as measured by test Y, we will pause any state-of-the-art training runs for a temporary period (days or weeks) during which capabilities can be more thoroughly investigated and a process can be followed (e.g., a discussion with the governing board) to decide on whether to take more intense actions, such as limiting commercial deployment or pausing capability scaling until protective measures can be improved.” But there should be specific pre-specified observations that at least kick off an intensive investigation and consideration of taking major actions - the RSP shouldn’t just say something like “The following people will regularly discuss whether continued capability scaling is safe.”

Example RSP language that could address this:

The dangerous capabilities we are focused on - the ones we think are most likely to go beyond what we’re prepared to safely handle in the medium term - are bioweapons development and autonomous replication and adaptation.

*Dangerous capability*	*Protective measures we’d need to handle this dangerous capability safely (beyond what we currently have)*	*Early warning signs we use for this dangerous capability*
Bioweapons development: the ability to walk step-by-step through developing a bioweapon, such that the majority of people with any life sciences degree (using the AI) could be comparably effective at bioweapon development to what people with specialized PhD’s (without AIs) are currently capable of.	A reliable method of ensuring it isn’t possible for users (internal or external) to get the model to help with bioweapons development (e.g. via usage restrictions, monitoring, or training the model to refuse to provide relevant information, with very high reliability). AND: Information security that could reliably keep our weights from being stolen by a determined terrorist group, hacking collective or hostile employee, and reliably prevents internal or external users from circumventing the safeguards above.	We contract with a biosecurity expert red-teamer, who asks our model detailed questions about a biology task similar to designing a bioweapon (but less dangerous) and compares their rate of answering correctly to a benchmark based on human experts. If the AI matches human performance, we will halt any training or deployment of AIs with this capability while we conduct further testing (such as comparing the ability of laypeople with and without the AI to complete relevant biology protocols) and determine whether additional protective measures are needed.
Autonomous replication and adaptation (ARA): the ability of an AI to acquire resources, create copies of itself, and adapt to novel challenges it encounters in the wild.	Information security that could reliably keep our weights from leaking to a moderately resourced state espionage program (such as Fancy Bear) or a malicious actor with lesser capabilities. AND: Measures to reliably detect and shut down attempts by non-exfiltrated copies of the model to accumulate resources (e.g. an LLM agent based on the API-hosted model obtaining many API keys, and subverting internal monitoring + detection efforts), before these get to a large scale. OR: a case that society is equipped to handle and shut down autonomously replicating models. To the extent this relies on very widespread use of highly capable AI systems that keep each other in check, it would be necessary to have high confidence in alignment to avoid unintended AI behavior.	We have a list of 10 specific tasks that are beyond today’s AI capabilities, but should be easily achievable for an ARA-capable agent. We regularly test our AIs to see whether they can complete these tasks. If we see an AI complete at least half of them, we will halt any training or deployment of AIs with this underlying capability while we conduct further testing and determine whether additional protective measures are needed.

Why is this important? Making pre-commitments about specific observations has a number of advantages (compared to simply collecting information about AI capabilities and considering the situation open-endedly):

There are hard questions about which AI capabilities are most dangerous, how to catch early warning signs of them, and what protective measures would be needed to handle them safely. These questions could take quite a bit of analysis, dialogue, expert consultation, etc. Specific pre-commitments mean an AI developer has resolved its views on these questions enough to know (at least for the time being) where its “red lines” are.
Open-ended discussion is more likely to result in something like “Pausing only when the results are clear and visceral enough that no more analysis is needed” - which seems likely to be when danger is very high, without leaving much slack.
We think it’s likely that some responses to dangerous capabilities (such as pausing further AI development until protective measures improve) would require a lot of preparation in advance to go smoothly. For example, it may be necessary to have clear plans for making money in the event of a capability scaling pause, and to have had clear communications with investors and employees about how this might play out.
AI developers are organizations with many stakeholders, with varying opinions about AI risks - and varying incentives. By default, it seems likely that different people at the company will have very different expectations about the conditions under which the company might take serious actions like de-deployment, capability scaling pauses, and stronger security measures. Stakeholders who don’t like a given decision might create a lot of resistance, including e.g. by lawsuits, making it harder to take important actions quickly and consistent with the success of the company. (And if important actions are unnecessarily costly to the company, due to violating stakeholders’ expectations, this might be an excessive incentive to simply not take the actions.)
Specific pre-commitments can make the whole company aware that particular test results would require particular levels of protective measures. That seems like a more effective incentive to prioritize protective measures with urgency and focus than a general understanding (interpreted differently by different people) that the AI developer might pause deployment and/or capability scaling at some point.
RSPs with clear precommitments can serve as examples for other companies, and as “pilots” that can help inform and improve future more formal rules and norms, such as industry standards and regulation.

Protections: what aspects of current protective measures are necessary to contain catastrophic risks?

An RSP should specify whether there are any protective measures the AI developer has in place that are necessary for reasonable assurance of public safety - such that if the protective measures turned out not to be working as intended, it might be necessary to greatly ramp up restrictions on AI models (as outlined below).

We believe that today’s AI systems (as of late 2023) have limited enough capabilities that there are no such “protective measure requirements.” Accordingly, we think language like the following would be reasonable in an RSP today: While we implement a number of measures to reduce potential harms from our AI systems, including measures to ensure information security, alignment and usage monitoring, we do not believe that any of these measures are strictly necessary to avoid a substantial risk of a major catastrophe.

But future RSPs might include language like: We believe that it would be prohibitively difficult for any attacker other than a state actor to steal model weights, and that it would be at least moderately difficult and/or risky for a moderately resourced state actor such as Fancy Bear to steal model weights. If either of these ceased being true, our protective measures would not be sufficient to safely handle our most capable AI models.

Evaluation: what are the procedures for promptly catching early warning signs of dangerous capability limits?

An RSP should contain detailed procedures for promptly catching early warning signs of the dangerous capabilities covered in the previous section.

These should include measures for dealing with each of the following challenges:

Leaving some margin for error. Early warning signs should be noticed before the dangerous capabilities in question emerge in a full-blown way. The RSP should explicitly take responsibility for this goal: it should describe what level of capabilities should not be created before protective measures are improved, and describe how the evaluation procedures will prevent that from happening. If the full-blown dangerous capabilities are later found, the lab has failed to meet the commitments in its RSP.
Evaluating capabilities midway through training runs. If an AI developer trains a system that is going to be hugely more capable than its current state-of-the-art system, then it needs to find a way to evaluate capabilities midway through the training run, to avoid “overshooting” its dangerous capability limits (going from an AI that is below the danger threshold to one that is well above it).
Getting a true representation of how capable its models could be in the wrong hands, not simply how capable they are in their exact current form. (a) If a malicious actor gains access to model weights, they might be able to do relatively cheap and simple things to (a) remove “harm refusal” training (e.g., train the AI to do harmful things when requested, rather than refusing); (b) make the model generally more capable and/or autonomous (for example, fine-tuning on tasks of interest, or using scaffolding setups like AutoGPT). (b) If an AI developer simply evaluates the capabilities of its current model as commercialized, it might drastically underestimate the capabilities the model could relatively easily gain. This is especially challenging when it comes to mid-training-run evaluations: by default, a checkpoint from partway through a training run will lack a lot of the improvements and augmentations that would apply to a commercialized model.
Catching key capabilities improvements that happen via subtle mechanisms. Big training runs aren’t the only way to advance the frontier of AI capabilities. Any number of changes to how an AI is fine-tuned, prompted, etc. could have big effects on its capabilities. An AI might even develop dangerous new capabilities due to new tools/plugins, or elicitation methods, developed by third-party users rather than via any deliberate effort on the AI developer’s part. An RSP should provide confidence that background improvements in elicitation methods or other changes in the world (that occur between the time of the evaluation and the point at which the lab will be prepared to handle the dangerous capabilities) will not push model capabilities above the danger level.
Lifetime of the evaluation. If there are dangerous capabilities that might not be possible to safely handle for several years, then even a model that is fairly far short of these capabilities might be unsafe (due to the possibility that its weights leak or are stolen, and future progress in post-training enhancements - such as new tools and elicitation techniques - makes the model attain these capabilities). So the size of the appropriate margin for error (#1 above) depends on the time until the AI developer can safely handle a dangerous model.

Example RSP language that could address this: We have a set of evaluations for dangerous capabilities, listed at [link], that are intended to be relatively easy to run, and relatively conservative (any AI that has the underlying dangerous capabilities in question - or seems predictably likely to manifest them with future advances in elicitation - should be likely to complete the tasks given our elicitation efforts at evaluation time).

We are confident that our evaluation procedure leaves enough margin to cover all the difficulties listed above (how much we scale the model between evaluations, differences between mid-training checkpoints and deployed models, all the finetuning, scaffolding, tools and resources that might be applied to the model, and predictable advances in elicitation techniques).

In order to accomplish this, we:

Make a “checkpoint copy” of the model being trained, at regular intervals during training. We space the intervals in order to ensure that we’re able to complete the evaluations before we reach a 4x scaleup from the last evaluated checkpoint
Make a substantial effort to modify the “checkpoint copy” in order to make it as capable as possible, subject to a compute budget of 1% of the compute used to train it so far. This is likely to include a learning rate cooldown, instruction fine-tuning, and other elicitation specific to the evaluations. We have assigned a specific employee to own this process of “checkpoint copy” improvement, with [team] reporting to them.
Run the models on the evaluation tasks. If there are elicitation techniques that we think will be available relatively soon that we are not able to implement during the checkpoint copy improvement process, the team will adapt how the evaluations are run or adjust the threshold to account for this. For instance, if we do not have a good tool available in-house for model browsing, but expect that such a tool will soon be publicly available, we may “mock up” this tool when we run the evaluations (e.g. by having a human assist the model with browsing).
Have a researcher read over the transcripts for the model’s attempts at the task, and summarize why the model failed to complete the task. The team will produce a report summarizing how they ran the evaluations, and whether they believe the model failures represent real lack of capabilities, or whether they might be artifacts of the task setup or insufficient elicitation. We have a list of best practices for the evaluations at [link], which provide guidance on what constitutes sufficient elicitation.
Give a third party, [third party], access to the improved “checkpoint copy” model to see whether they are able to demonstrate dangerous capabilities beyond what we’ve estimated the model to be capable of. We do not wait on their results, but use them to help understand whether our elicitation was adequate, and how much margin we need to leave.
Every 3 months, or upon request from an employee or third party from our authorized list using [process], we evaluate our most capable models using a similar process to the one above. It is important that the versions of the models we evaluate have not been trained for harm refusal.

We believe the above measures are adequate to catch early warning signs before full-blown dangerous capabilities emerge. If key dangerous capabilities are seen in our models before we’ve developed the appropriate protective measures, even if this is partly due to advances in elicitation techniques outside of our lab, that represents a shortfall of our RSP.

Response: what happens if dangerous capabilities go past the limits and it’s not possible to improve protections quickly?

An RSP should state what the AI developer will do if dangerous capabilities go past what their protective measures can safely handle, and it’s not possible to improve protective measures relatively quickly. In this case, the default should be to pause further capability improvements until protective measures are sufficiently improved, as well as to exercise strong caution with any dangerous models.

The AI developer should have a vision for what its employees would be doing and how it would plan to sustain itself as a business in such a scenario.

Example RSP language that could address this:

If we find that one of our AI systems is plausibly capable of autonomous replication and adaptation (ARA), at a time when we are not in a position to promptly achieve the corresponding protective measures (see table above), then we will take the following measures to contain the risk:

The model in question is marked “handle with care.”
We do not make the model in question available except for research purposes, to people inside the company with a specific mandate to research dangerous models as well as trusted third parties with a similar mandate. Specifically, we do not offer the model commercially. (If the model is already in wide use at the time that we determine that it is plausibly ARA-capable, we aim to restrict its access as just outlined within a week.) All those who work with the model are informed of the intended restrictions on it, and are given a channel to register concerns with or without attribution if they believe the restrictions aren’t being followed.
We aim not to do further work to improve the capabilities of the “handle with care” model, except insofar as this is necessary for specific research that could help reduce risks. Accordingly, all fine-tuning runs need to include an explanation of why they are important for risk reduction and aren’t improving model capabilities more than needed, which will be reviewed by ___.
We aim not to develop any model that is significantly more capable than the “handle with care” model. This includes not initiating any new training runs with greater “effective compute” (see above) than the “handle with care” model, and not doing further work to improve the capabilities of the “handle with care” model.
We take the following other precautions with the “handle with care” model:
- Code produced by the model is treated as untrusted. We aim never to run such code with access to resources that wouldn’t be available to a member of the public. We also disallow such code from being integrated into our codebase.
- The model is not allowed to interact with any web service or otherwise autonomously interact with the outside world, or communicate with anyone other than those authorized to do research on it.
- All queries to the model are logged. Logs are manually reviewed weekly by *** to understand how the model is being used internally, and *** summarizes their findings for review by ___.
- We restrict internal access to the weights to the degree feasible while still allowing research on the model as outlined above.

By restricting deployment and improvement of our most capable systems, these measures could mean substantial changes to what some of our core employees work on, as well as to our financial prospects. We have drawn up a rough outline of what people formerly focused on capability improvement would be doing, and how we would be continuing to make money, in a separate document [link].

In the event that we have strong reason to think that other AI developers are moving forward with comparably dangerous AI models, and we have exhausted other avenues for reducing the associated risks, we might make an exception to the above plan and continue development - while working with states or other authorities to take immediate actions to limit scaling that would affect all AI developers (including us). We would consider this a dire situation. We would seek input on our choices from the US government, and would be explicit - with employees, our board, and state authorities - that our scaling was no longer safe, and that we should be accountable for the judgment to proceed.

[Similar language to the above could be used for other dangerous capabilities as well, including bioweapons development]

Why is this important? The protective measures required for some dangerous capabilities could be well beyond what a company can accomplish within a few weeks or even months, even with maximum effort. For example, information security that can protect against even moderately resourced state actors could take years to achieve; alignment research that can provide assurance that AI systems will behave as intended might not be feasible until there are major advances in the basic science of alignment.

For this reason, measures such as the above might be the only way for an AI developer to avoid a serious risk of causing a major catastrophe. It’s important for the AI developer to have a plan for such a scenario so that it isn’t trying to improvise such measures in the face of opposition from stakeholders.

Accountability

This section has three sub-sections:

Verification: how does the AI developer ensure that the RSP’s commitments are executed as intended, and that key stakeholders can verify that this is happening (or notice if it isn’t)?
External review: are there opportunities for third-party critique of the RSP and/or associated processes?
Revising the RSP: how does the AI developer ensure that changes to the RSP itself don’t happen in a rushed or opaque way?

Verification

An RSP should include measures for ensuring that its commitments are executed as intended, and that key stakeholders can verify that this is happening (or notice if it isn’t).

Example RSP language that could address this:

*RSP component*	*Responsible employee*	*Brief notes on how they will be responsible and accountable for the component*
Evaluation frequency: ensuring that evaluations are done periodically, on request by authorized employees, and at checkpoints during training, as specified above	[name]	The responsible employee writes a quarterly report on which evaluations were performed, when, and whether this complies with the intentions stated in the RSP. This report is shared with [list of employees] immediately, and is shared with [third party] within [delay], sometimes with redactions that the responsible employee deems necessary.
Evaluation execution: ensuring that evaluations assess what a model could be capable of (more above), and reporting results	[name]	Whenever evaluations have been completed, the responsible employee creates a brief internal writeup with the measures they’ve taken for model improvement prior to evaluations, and how they think their improved model’s capabilities compare to the capabilities it would have if (a) they had had another month to improve it (b) it were a fully improved, commercialized model. They also discuss the results of the evaluation (including any noteworthy observations, even if not “pass/fail”) and how they relate to previous observations and future expectations about model capabilities. This report is shared with [list of employees] immediately, and is shared with [third party] within [delay], sometimes with redactions that the responsible employee deems necessary.
Protective measures	[name]	The responsible employee writes a quarterly report on the state of our protective measures, and whether they meet the requirements outlined in this section. They explicitly note whether there is anything they were unable to address due to lack of access or cooperation. These reports are shared with [list of employees] immediately, and with [third party] within [delay], sometimes with redactions that the responsible employee deems necessary.
Responding to dangerous capabilities	[name]	The responsible employee writes a quarterly report on whether we are prepared to make good on the commitments in this section, including whether our stakeholders (employees, investors, etc.) are aware of these commitments and their possible implications, and including whether we have any obligations to employees, investors, customers or others that such commitments would conflict with. They explicitly note whether there is anything they were unable to address due to lack of access or cooperation. If dangerous capability thresholds are crossed, the responsible employee becomes responsible for the full set of commitments. Within one week of the threshold(s) being crossed, they write up a detailed plan (including e.g. who is responsible for what) for making good on these commitments, which is shared with [list of employees] as well as the organization’s CEO and board of directors. They then provide updates every two weeks afterward, up until the 3-month mark, at which point updates become monthly. These updates detail: (a) Whether, as far as they know, all of the commitments in this section were adhered to, (b) Plans for addressing any areas where adherence may have been incomplete, or may be incomplete going forward, and (c) Whether they had the access and information to be confident in their current understanding. The responsible employee’s plan and reports are shared with [list of employees] immediately, and with [third party] within [delay], sometimes with redactions that the responsible employee deems necessary.

People with reservations about the claims in any of the reports referenced above are encouraged to share them directly with the responsible employee, or report them anonymously to the board using [process, including provisions for protection from retribution].

External Review

An RSP should contain measures for soliciting critique and accountability from parties outside of the organization.

Example RSP language that could address this:

We have solicited feedback on our RSP from the following people, whom we consider to be among the best-qualified parties outside of this organization to comment on it:

[list of ~5 people]

Each of these people agreed to spend 20 hours reviewing our RSP and providing feedback, and provided their thoughts in writing. Their submissions were reviewed by all of the “responsible employees” listed above as well as our board of directors. A redacted compilation of their feedback is available at [link].

Revising the RSP

It’s important to allow flexibility for changing circumstances and new information. As such, AI developers should reserve the option to change their policies. But they should also ensure that this doesn’t happen in an overly rushed or opaque way: changes should be clearly and widely announced to stakeholders, and there should be an opportunity for critique.

Example RSP language that could address this:

The current RSP is shared with the board and all employees. Employees with critiques, concerns, or reason to believe the RSP won’t or can’t be executed as intended can share thoughts directly with the board and a pre-selected set of other employees using [process], or anonymously using [other process that includes protections for anonymity and against retribution].

In order to revise the RSP, we must follow the following process:

Disseminate the proposed new RSP to [list of employees] as well as [list of external parties].
Allow 2 weeks for them to share reactions and reservations, which can be done with or without attribution using [process, including protection against retribution].
Share the proposed changes along with employees’ reactions and reservations with the board of directors. Any director may ask for up to 2 weeks to review these and ask questions before voting to approve the changes.
Changes must be approved by a majority of the board of directors.

For minor and/or urgent changes, we may adopt changes to our RSP prior to review. In these cases, we require changes to the RSP to be approved by a supermajority of our board. Review still takes place as described above, after the fact.

Notes on some alternative ways one could construct an RSP

The “example language” above tends to focus on the idea of consistently testing for dangerous capabilities. There could be alternative approaches to an RSP, including:

Less emphasis on testing, more emphasis on scaling up capabilities slowly and gradually. An RSP might have a central idea along the lines of: “We don’t scale up our models by more than 2x effective compute every 6 months, and we deploy models promptly and with few-to-no restrictions, so that our users have comparable access to them that someone would if they stole the weights. Because of this, we can get a sense of risks by observing real-world interaction of models with society, and this means we don’t need as much reliance on testing.”

Such an RSP should still articulate dangerous capability limits and prepare for pausing deployment and/or capability scaling.

Staying behind the frontier. An RSP might have a central idea along the line of: “We don’t train or deploy AI models that might have dangerous capabilities well beyond what others have already deployed. Because of this, we can focus on matching others’ standards for the protective measures needed to contain dangerous capabilities, rather than needing to perform our own testing for dangerous capabilities.” Such an RSP might be relatively simple and easy to execute, with most of the challenge being rigorously ensuring that the AI developer’s models are no more capable than other deployed models, and that its protective measures are comparable.