METR

Model Evaluation & Threat Research

METR conducts research and evaluations to improve public understanding of the capabilities and risks of frontier AI systems.

Our research Careers

We’ve worked with

Task-Completion Time Horizons of Frontier AI Models

We propose measuring AI performance in terms of the length of software tasks AI agents can complete. We show an exponential increase in this time horizon metric over the past 6 years.

Read paper View repo

Featured research

Our AI evaluations research focuses on assessing broad autonomous capabilities and the ability of AI systems to accelerate AI R&D. We also study potential AI behavior that threatens the integrity of evaluations and mitigations for such behavior.

View all research

GPT-5.1 Evaluation Results

We evaluate whether GPT-5.1 poses significant catastrophic risks via AI self-improvement, rogue replication, or sabotage of AI labs.

Measuring AI Ability to Complete Long Tasks

We propose measuring AI performance in terms of the length of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing.

Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity

We found that when developers used AI tools in early 2025, they took 19% longer than without—AI made them slower.

MALT

A dataset of natural and prompted examples of behaviors that threaten evaluation integrity, like generalized reward hacking or sandbagging

Measuring autonomous AI capabilities — resource collection

An index of our research and guidance on how to measure AI systems' ability to autonomously complete a wide range of multi-hour tasks

Early work on monitorability evaluations

We show preliminary results on a prototype evaluation that tests monitors' ability to catch AI agents doing side tasks, and AI agents' ability to bypass this monitoring

Common Elements of Frontier AI Safety Policies

An analysis of the shared components across twelve published frontier AI safety policies, including capability thresholds, model weight security, and deployment mitigations

Frontier AI Safety Policies

A list of AI companies' frontier safety policies intended to evaluate and manage severe AI risks

What should companies share about risks from frontier AI models?

We describe areas for risk transparency and specific technical questions that a frontier AI developer could answer.

Evaluation reports

We conduct evaluations of the autonomous capabilities of frontier AI models, with some in partnership with AI developers such as Anthropic and OpenAI. We do this both to understand the models' capabilities and to pilot third-party evaluator arrangements.

GPT-5.1-Codex-Max

19 November 2025 •
Partnership

GPT-5

7 August 2025 •
Partnership

DeepSeek and Qwen

27 June 2025 •
No company involvement

OpenAI o3 and o4-mini

16 April 2025 •
Partnership

Claude 3.7

4 April 2025 •
Partnership

DeepSeek-R1

5 March 2025 •
No company involvement

GPT-4.5

27 February 2025 •
Partnership

DeepSeek-V3

12 February 2025 •
No company involvement

Claude 3.5 Sonnet and o1

31 January 2025 •
Partnership

Claude 3.5 Sonnet (original)

30 October 2024 •
Partnership

o1-preview

12 September 2024 •
Partnership

GPT-4o

7 August 2024 •
Partnership

GPT-4 and Claude

17 March 2023 •
Partnership

View all evaluation reports

METR does not accept compensation for this work.

Companies such as OpenAI and Anthropic have provided access and compute credits to support evaluation research. We also occasionally evaluate models independently after they are released, without involvement from the model's developers. Recent public reports resulting from this work are above, with additional discussion in the respective system cards.

Frontier AI Safety Policies

We advise AI developers and governments on implementing risk assessment methodologies for AI. For example, we have advised developers on Frontier AI Safety Policies.

Resources on FSPs

Google

OpenAI

Anthropic

METR in the press

View all press coverage

Recent

Measuring the Self-Reported Impact of Early-2026 AI on Technical Worker Productivity

11 May 2026

A survey of 349 technical workers finds a median 1.4–2x self-reported change in value of work due to AI tools, expected to grow over time, though there are reasons to be skeptical of the magnitude.

Measuring the Self-Reported Impact of Early-2026 AI on Technical Worker Productivity

Review of the "Risks from automated R&D" section in the Anthropic Risk Report (February 2026)

8 May 2026

External review from METR of the "Risks from automated R&D" section in Anthropic's February 2026 Risk Report

Task Substitution and Uplift

8 May 2026

We distinguish three measures of AI uplift -- on old tasks, on new tasks, and in value -- and show that task substitution can cause these to diverge substantially.

Evidence on AI R&D Progress from NanoGPT

21 April 2026

Classifying human and agent contributions to the NanoGPT speedrun, and what publicly tracked challenges can tell us about AI R&D acceleration.

Evidence on AI R&D Progress from NanoGPT

MirrorCode: Evidence that AI can already do some weeks-long coding tasks

10 April 2026

Fine-tuning experiments on CoT controllability

1 April 2026

We find that a small amount of fine-tuning on instruction following in the CoT generalizes to meaningful increases in CoT controllability on an out-of-distribution set of tasks. We fine-tune four reasoning models on small datasets of instruction-following reasoning data and OOD controllability rises from an average of 2.9% to 8.8% across four models.