METR

Model Evaluation and Threat Research

Formerly “ARC Evals”, METR was incubated at the Alignment Research Center and is now a standalone non-profit.

METR is a research nonprofit that works on assessing whether cutting-edge AI systems could pose catastrophic risks to society.

We build the science of accurately assessing risks, so that humanity is informed before developing transformative AI systems.

AI’s Transformative Potential

We believe that AI could change the world quickly and drastically, with potential for both enormous good and enormous harm. We also believe it’s hard to predict exactly when and how this might happen.

At some point, AIs will probably be able to do most of what humans can do, including developing new technologies; starting businesses and making money; finding new cybersecurity exploits and fixes; and more.

We think it’s very plausible that AI systems could end up pursuing goals that are at odds with a thriving civilization. This could be due to someone’s deliberate effort to cause chaos,1 or happen despite the intention to only develop AI systems that are safe.2

Given how quickly things could play out, we don’t think it’s good enough to “wait and see” whether there are dangers.

We believe in vigilantly, continually assessing risks. If an AI brings significant risk of a global catastrophe, the decision to develop and/or release it can’t lie only with the company that creates it.

Partnerships

We have previously worked with Anthropic, OpenAI, and other companies to pilot some informal pre-deployment evaluation procedures. These companies have also given us some kinds of non-public access and provided compute credits to support evaluation research.

We think it’s important for there to be third-party evaluators with formal arrangements and access commitments – both for evaluating new frontier models before they are scaled up or deployed, and for conducting research to improve evaluations.

We do not yet have such arrangements, but we are excited about taking more steps in this direction.

We are also partnering with the UK AI Safety Institute and are part of the NIST AI Safety Institute Consortium.

Our Work

Our research is focused on how to evaluate AI systems for dangerous capabilities:

Autonomy Evaluation Resources

These are resources to evaluate AI systems for risks from autonomous capabilities. They include a suite of tasks, software tooling, and an example overall evaluation protocol.

Read More
Vivaria

Open-source platform for agent evaluations and capability elicitation research.

Read More
Task Standard

A standard way to define tasks for evaluating the capabilities of AI agents.

Read More
Responsible Scaling Policies (RSPs)

A proposal for evals-based catastrophic risk reduction adopted by several AI developers.

Read More

Blog

  1. For instance, an anonymous user set up ChatGPT with the ability to run code and access the internet, prompted it with the goal of ”destroy humanity,” and set it running. 

  2. There are reasons to expect goal-directed behavior to emerge in AI systems, and to expect that superficial attempts to align ML systems will result in sycophantic or deceptive behavior – “playing the training game” – rather than successful alignment.