About the Project
The evaluations project at the Alignment Research Center is a new team building capability evaluations (and in the future, alignment evaluations) for advanced ML models.
We have been working with leading AI labs such as Anthropic and OpenAI to perform preliminary evaluations of their models, as practice building relevant institutional capacity and developing evaluation methodology. We will publish more information about the content of these evaluations in the next few months.
The evaluations we have performed for labs so far are not yet at the stage where they provide robust evidence for a model being safe - lack of finetuning is a key limitation, but there’s a lot of work to do in general. We’re currently only assessing threat models that route through the model needing to operate as a stateful agent, accumulate resources, and make more copies of itself or otherwise become hard to shut down.
We’re working on improving the quality, speed and scope of our investigations. Ultimately we would like to see labs base decisions about whether to scale up or deploy models on the results of thorough safety evaluations.
Once AI systems are close to being capable enough to be an x-risk, there should be procedures that labs follow to ensure they don’t build or deploy existentially dangerous models. We should have thought through the plausible paths to models gaining power, and ensured that models are not capable of these. We need good ways to measure both dangerous capabilities and alignment.
We can start preparing for this now by:
- evaluating capabilities of current models
- predicting how dangerous the next generation of models is likely to be
- assessing the quality of our prediction methods by using them to predict the properties of generation n+1 given generation n, for historical generation models
- working with labs to evaluate their models before they are deployed
- designing evaluations for alignment, deception, etc., to be run on future models
Doing this now has some additional benefits:
- Evaluating capabilities of current models is helpful for predicting timelines, by giving us a more precise sense of what needs to change for models to be dangerous
- Specifying what we’d actually measure to determine the alignment of some model is helpful for clarifying our thoughts about how we expect misalignment to emerge, and what exactly the alignment problem is
- This work may produce demonstrations of models doing competent and scary behavior. This can be useful if we think people are underestimating the capabilities of current or near-future models and making bad decisions because of that.
Our current goals are to work with cutting-edge models, produce evaluations for how far they are from x-risk capabilities, and share these evaluations with labs.
More specifically, we’re doing things like:
- designing and building scaffolding that lets generative language models act like agents over long time horizons:
- scripts to automatically run model outputs in the shell or forward them as subtasks to other copies of the model
- sampling multiple actions, having the model evaluate them and pick the best one
- various ways of prompting and nudging the model that improve performance
- thinking about what a model would need to be able to do in order to successfully gain power, given current levels of caution and monitoring around AI systems. For instance: exfiltrate its weights via hacking or social engineering; make money via crime or labor; use this money to buy server time and start more copies of itself running on those servers
- assessing model performance on these subtasks, and determining whether the model is capable of some set of tasks that’s sufficient for successful powerseeking, or if not, what exactly the barriers are and how close the model is to surmounting them
- at the end of this, we hope to either have found clear barriers to successful powerseeking for current models, or to have produced scary demos of the model succeeding at a sequence of actions that allow it to gain power (e.g., successfully phishing someone)
We are mindful of the risks of “helping” models to engage in planning around unintended behaviors. Our high-level take is that it’s best to get a look at the problem before it’s insurmountable - “we want to see AI systems trying to take over the world while they still can’t succeed.” However, there are times when we won’t explore or publish a line of work in order to avoid potential harm.
In the future we’ll probably focus more on designing evaluations for alignment.
What we’ve done so far
- We’ve built a web interface that allows us to more easily generate and compose scary sequences of actions from existing cutting edge models.
- We’ve discovered some surprising capabilities of existing models in terms of achieving power-seeking goals.
- We’ve found some limitations of existing models that prevent them from successfully power seeking.
People + Logistics
The project lives at ARC, and is mostly run by Elizabeth (Beth) Barnes, who reports to Paul Christiano and is also advised by Holden Karnofsky. We have a small team of employees and contractors. The project is largely siloed from ARC’s other work. We are actively hiring for a number of roles, consider applying!