Research
Our AI evaluations research focuses on assessing broad autonomous capabilities and the ability of AI systems to accelerate AI R&D. We also study potential AI behavior that threatens the integrity of evaluations and mitigations for such behavior.
Time Horizon 1.1
Time Horizon 1.1
29 January 2026

We’re releasing a new version of our time horizon estimates (TH1.1), using more tasks and a new eval infrastructure.

Read more
Early work on monitorability evaluations
Early work on monitorability evaluations
22 January 2026

We show preliminary results on a prototype evaluation that tests monitors' ability to catch AI agents doing side tasks, and AI agents' ability to bypass this monitoring.

Read more
GPT-5.1-Codex-Max Evaluation Results
GPT-5.1-Codex-Max Evaluation Results
19 November 2025

We evaluate whether GPT-5.1-Codex-Max poses significant catastrophic risks via AI self-improvement or rogue replication. We conclude that this seems unlikely.

Read more
MALT: A Dataset of Natural and Prompted Behaviors That Threaten Eval Integrity
MALT: A Dataset of Natural and Prompted Behaviors That Threaten Eval Integrity
14 October 2025

MALT (Manually-reviewed Agentic Labeled Transcripts) is a dataset of natural and prompted examples of behaviors that threaten evaluation integrity (like generalized reward hacking or sandbagging).

Read more
Forecasting the Impacts of AI R&D Acceleration: Results of a Pilot Study
Forecasting the Impacts of AI R&D Acceleration: Results of a Pilot Study
20 August 2025

AI agents are improving rapidly at autonomous software development and machine learning tasks, and, if recent trends hold, may match human researchers at challenging months-long research projects in under a decade. Some economic models predict that automation of AI research by AI agents could increase the pace of further progress dramatically, with many years of progress at the current rate being compressed into months. AI developers have identified this as a key capability to monitor and prepare for, since the national security implications and societal impacts of such rapid progress could be enormous.

Read more
Research Update: Algorithmic vs. Holistic Evaluation
Research Update: Algorithmic vs. Holistic Evaluation
13 August 2025

Many AI benchmarks use algorithmic scoring to evaluate how well AI systems perform on some set of tasks. However, AI systems often produce code that scores well but isn't production-ready due to issues with test coverage, formatting, and code quality. This helps explain why AI tools show less productivity improvement than expected despite strong performance on coding benchmarks.

Read more
CoT May Be Highly Informative Despite “Unfaithfulness”
CoT May Be Highly Informative Despite “Unfaithfulness”
8 August 2025

Recent work from Anthropic and others claims that LLMs' chains of thoughts can be “unfaithful”. These papers make an important point: you can't take everything in the CoT at face value. As a result, people often use these results to conclude the CoT is useless for analyzing and monitoring AIs. Here, instead of asking whether the CoT always contains all information relevant to a model's decision-making in all problems, we ask if it contains enough information to allow developers to monitor models in practice. Our experiments suggest that it might.

Read more
GPT-5 Evaluation Results
GPT-5 Evaluation Results
7 August 2025

We evaluate whether GPT-5 poses significant catastrophic risks via AI self-improvement, rogue replication, or sabotage of AI labs. We conclude that this seems unlikely. However, capability trends continue rapidly, and models display increasing eval awareness.

Read more
How Does Time Horizon Vary Across Domains?
How Does Time Horizon Vary Across Domains?
14 July 2025

We build on our time-horizon work and analyze 9 benchmarks for scientific reasoning, math, robotics, computer use, and self-driving in terms of time-horizon trends; we observe generally similar rates of improvement to the 7-month doubling time in our original time-horizon work.

Read more
Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity
Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity
10 July 2025

We conduct a randomized controlled trial to understand how early-2025 AI tools affect the productivity of experienced open-source developers working on their own repositories. Surprisingly, we find that when developers use AI tools, they take 19% longer than without—AI makes them slower.

Read more
DeepSeek and Qwen Evaluation Results
DeepSeek and Qwen Evaluation Results
27 June 2025

METR conducted preliminary evaluations of a set of DeepSeek and Qwen models. We found that the level of autonomous capabilities of mid-2025 DeepSeek models is similar to the level of capabilities of frontier models from late 2024.

Read more
Recent Frontier Models Are Reward Hacking
Recent Frontier Models Are Reward Hacking
5 June 2025

In the last few months, we’ve seen increasingly clear examples of reward hacking on our tasks: AI systems try to “cheat” and get impossibly high scores. They do this by exploiting bugs in our scoring code or subverting the task setup, rather than actually solving the problem we’ve given them. This isn’t because the AI systems are incapable of understanding what the users want–they demonstrate awareness that their behavior isn't in line with user intentions and disavow cheating strategies when asked—but rather because they seem misaligned with the user’s goals.

Read more
OpenAI o3 and o4-mini Evaluation Results
OpenAI o3 and o4-mini Evaluation Results
16 April 2025

METR conducted a preliminary evaluation of OpenAI's o3 and o4-mini. The two models displayed higher autonomous capabilities than other public models tested, and o3 appears somewhat prone to "reward hacking".

Read more
Claude 3.7 Evaluation Results
Claude 3.7 Evaluation Results
4 April 2025

METR conducted a preliminary evaluation of Claude 3.7 Sonnet. While we failed to find significant evidence for a dangerous level of autonomous capabilities, the model displayed impressive AI R&D capabilities on a subset of RE-Bench which provides the model with ground-truth performance information. We believe these capabilities are central to important threat models and should be monitored closely.

Read more
Measuring AI Ability to Complete Long Tasks
Measuring AI Ability to Complete Long Tasks
19 March 2025

We propose measuring AI performance in terms of the *length* of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing over the past 6 years, with a doubling time of around 7 months. Extrapolating this trend predicts that, in under a decade, we will see AI agents that can independently complete a large fraction of software tasks that currently take humans days or weeks.

Read more
HCAST: Human-Calibrated Autonomy Software Tasks
HCAST: Human-Calibrated Autonomy Software Tasks
17 March 2025

Sharing details about HCAST (Human-Calibrated Autonomy Software Tasks), a benchmark we’ve been developing to measure the abilities of frontier AI systems to complete diverse software tasks autonomously.

Read more
DeepSeek-R1 Evaluation Results
DeepSeek-R1 Evaluation Results
5 March 2025

We evaluated DeepSeek-R1 for dangerous autonomous capabilities and found no evidence of dangerous capabilities beyond those of existing models such as Claude 3.5 Sonnet and GPT-4o. Interestingly, we found that it did not do substantially better than DeepSeek-V3 on our autonomy suite.

Read more
METR’s GPT-4.5 pre-deployment evaluations
METR’s GPT-4.5 pre-deployment evaluations
27 February 2025

Additional details about our evaluations of GPT-4.5, and some discussion about the limitations of pre-deployment evaluations and current evaluation methodologies.

Read more
Measuring Automated Kernel Engineering
Measuring Automated Kernel Engineering
14 February 2025

We measured the performance of frontier models at writing GPU kernels. With a small amount of scaffolding, we found that the best model can provide an average speedup on KernelBench of 1.8x.

Read more
DeepSeek-V3 Evaluation Results
DeepSeek-V3 Evaluation Results
12 February 2025

We evaluated DeepSeek-V3 for dangerous autonomous capabilities and found no evidence of dangerous capabilities beyond those of existing models such as Claude 3.5 Sonnet and GPT-4o. We also confirmed that its performance on GPQA is not due to training data contamination.

Read more
An update on our preliminary evaluations of Claude 3.5 Sonnet and o1
An update on our preliminary evaluations of Claude 3.5 Sonnet and o1
31 January 2025

Preliminary evaluations of Claude 3.5 Sonnet (New) and o1, as well as some discussion of challenges in making capability-based safety arguments for AI models.

Read more
Evaluating frontier AI R&D capabilities of language model agents against human experts
Evaluating frontier AI R&D capabilities of language model agents against human experts
22 November 2024

We’re releasing RE-Bench, a new benchmark for measuring the performance of humans and frontier model agents on ML research engineering tasks. We also share data from 71 human expert attempts and results for Anthropic’s Claude 3.5 Sonnet and OpenAI’s o1-preview.

Read more
The Rogue Replication Threat Model
The Rogue Replication Threat Model
12 November 2024

Thoughts on how AI agents might develop large and resilient rogue populations.

Read more
Details about METR’s preliminary evaluation of o1-preview
Details about METR’s preliminary evaluation of o1-preview
12 September 2024

We measured the performance of OpenAI's o1-mini and o1-preview models on our autonomy and AI R&D task suites, and found they did not exceed the capabilities of the best existing public model we've evaluated, though we could not confidently upper-bound their capabilities.

Read more
Evaluation platform: Vivaria
Evaluation platform: Vivaria
20 August 2024

METR's tool for running evaluations and conducting agent elicitation research.

Read more
Details about METR’s preliminary evaluation of GPT-4o
Details about METR’s preliminary evaluation of GPT-4o
7 August 2024

We measured the performance of GPT-4o given a simple agent scaffolding on 77 tasks across 30 task families testing autonomous capabilities.

Read more
An update on our general capability evaluations
An update on our general capability evaluations
6 August 2024

More tasks, human baselines, and preliminary results for GPT-4 and Claude.

Read more
Autonomy Evaluation Resources
Autonomy Evaluation Resources
15 March 2024

A collection of resources for evaluating potentially dangerous autonomous capabilities of frontier models.

Read more
Example autonomy evaluation protocol
Example autonomy evaluation protocol
15 March 2024

An example protocol for the whole evaluation process, based on our task suite, elicitation protocol, and scoring methods.

Read more
Example autonomy task suite
Example autonomy task suite
15 March 2024

A suite of tasks designed to test autonomous abilities at different difficulty levels.

Read more
Guidelines for capability elicitation
Guidelines for capability elicitation
15 March 2024

Priorities for approximating the full potential capability of an AI agent, and recommended checks for evaluation validity.

Read more
Measuring the impact of post-training enhancements
Measuring the impact of post-training enhancements
15 March 2024

A brief research report outlining quantitative research which could inform the “safety margin” to add to take into account further post-training enhancements to agent capability.

Read more
Portable Evaluation Tasks via the METR Task Standard
Portable Evaluation Tasks via the METR Task Standard
29 February 2024

METR has published a standard way to define tasks for evaluating the capabilities of AI agents.

Read more
New report: Evaluating Language-Model Agents on Realistic Autonomous Tasks
New report: Evaluating Language-Model Agents on Realistic Autonomous Tasks
31 July 2023

We have just released our first public report. It introduces methodology for assessing the capacity of LLM agents to acquire resources, create copies of themselves, and adapt to novel challenges they encounter in the wild.

Read more
Update on ARC's recent eval efforts
Update on ARC's recent eval efforts
17 March 2023

More information about ARC's evaluations of GPT-4 and Claude

Read more