研究 - METR

Expenditure Horizon: Measuring Optimization Ability, with an Application to NanoGPT

2026年7月21日

We propose a measure of an AI agent’s optimization ability with an "expenditure horizon." We give an empirical illustration from the NanoGPT speedrun.

阅读英文原文

前沿 AI 风险报告（2026 年 2–3 月）

2026年5月19日

一项试点评估，关注前沿 AI 公司内部 AI 智能体失控部署的风险。

阅读全文

Measuring the Self-Reported Impact of Early-2026 AI on Technical Worker Productivity

2026年5月11日

A survey of 349 technical workers finds a median 1.4–2x self-reported change in value of work due to AI tools, expected to grow over time, though there are reasons to be skeptical of the magnitude.

阅读英文原文

Task Substitution and Uplift

2026年5月8日

We distinguish three measures of AI uplift -- on old tasks, on new tasks, and in value -- and show that task substitution can cause these to diverge substantially.

阅读英文原文

MirrorCode: Evidence that AI can already do some weeks-long coding tasks

2026年4月10日

Early results from MirrorCode benchmark with METR: AI agents can complete weeks-long coding tasks, including reimplementing a 16,000-line codebase.

阅读英文原文

We are Changing our Developer Productivity Experiment Design

2026年2月24日

Our second developer productivity study faces selection effects from wider AI adoption, prompting us to redesign our approach.

阅读英文原文

Time Horizon 1.1

2026年1月29日

We’re releasing a new version of our time horizon estimates (TH1.1), using more tasks and a new eval infrastructure.

阅读英文原文

Early work on monitorability evaluations

2026年1月22日

We show preliminary results on a prototype evaluation that tests monitors' ability to catch AI agents doing side tasks, and AI agents' ability to bypass this monitoring.

阅读英文原文

GPT-5.1-Codex-Max Evaluation Results

2025年11月19日

We evaluate whether GPT-5.1-Codex-Max poses significant catastrophic risks via AI self-improvement or rogue replication. We conclude that this seems unlikely.

阅读英文原文

MALT: A Dataset of Natural and Prompted Behaviors That Threaten Eval Integrity

2025年10月14日

MALT (Manually-reviewed Agentic Labeled Transcripts) is a dataset of natural and prompted examples of behaviors that threaten evaluation integrity (like generalized reward hacking or sandbagging).

阅读英文原文

Forecasting the Impacts of AI R&D Acceleration: Results of a Pilot Study

2025年8月20日

AI agents are improving rapidly at autonomous software development and machine learning tasks, and, if recent trends hold, may match human researchers at challenging months-long research projects in under a decade. Some economic models predict that automation of AI research by AI agents could increase the pace of further progress dramatically, with many years of progress at the current rate being compressed into months. AI developers have identified this as a key capability to monitor and prepare for, since the national security implications and societal impacts of such rapid progress could be enormous.

阅读英文原文

Research Update: Algorithmic vs. Holistic Evaluation

2025年8月13日

Many AI benchmarks use algorithmic scoring to evaluate how well AI systems perform on some set of tasks. However, AI systems often produce code that scores well but isn't production-ready due to issues with test coverage, formatting, and code quality. This helps explain why AI tools show less productivity improvement than expected despite strong performance on coding benchmarks.

阅读英文原文

CoT May Be Highly Informative Despite “Unfaithfulness”

2025年8月8日

Recent work from Anthropic and others claims that LLMs' chains of thoughts can be “unfaithful”. These papers make an important point: you can't take everything in the CoT at face value. As a result, people often use these results to conclude the CoT is useless for analyzing and monitoring AIs. Here, instead of asking whether the CoT always contains all information relevant to a model's decision-making in all problems, we ask if it contains enough information to allow developers to monitor models in practice. Our experiments suggest that it might.

阅读英文原文

GPT-5 Evaluation Results

2025年8月7日

We evaluate whether GPT-5 poses significant catastrophic risks via AI self-improvement, rogue replication, or sabotage of AI labs. We conclude that this seems unlikely. However, capability trends continue rapidly, and models display increasing eval awareness.

阅读英文原文

How Does Time Horizon Vary Across Domains?

2025年7月14日

We build on our time-horizon work and analyze 9 benchmarks for scientific reasoning, math, robotics, computer use, and self-driving in terms of time-horizon trends; we observe generally similar rates of improvement to the 7-month doubling time in our original time-horizon work.

阅读英文原文

Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity

2025年7月10日

We conduct a randomized controlled trial to understand how early-2025 AI tools affect the productivity of experienced open-source developers working on their own repositories. Surprisingly, we find that when developers use AI tools, they take 19% longer than without—AI makes them slower.

阅读英文原文

DeepSeek and Qwen Evaluation Results

2025年6月27日

METR conducted preliminary evaluations of a set of DeepSeek and Qwen models. We found that the level of autonomous capabilities of mid-2025 DeepSeek models is similar to the level of capabilities of frontier models from late 2024.

阅读英文原文

Recent Frontier Models Are Reward Hacking

2025年6月5日

In the last few months, we’ve seen increasingly clear examples of reward hacking on our tasks: AI systems try to “cheat” and get impossibly high scores. They do this by exploiting bugs in our scoring code or subverting the task setup, rather than actually solving the problem we’ve given them. This isn’t because the AI systems are incapable of understanding what the users want–they demonstrate awareness that their behavior isn't in line with user intentions and disavow cheating strategies when asked—but rather because they seem misaligned with the user’s goals.

阅读英文原文

OpenAI o3 and o4-mini Evaluation Results

2025年4月16日

METR conducted a preliminary evaluation of OpenAI's o3 and o4-mini. The two models displayed higher autonomous capabilities than other public models tested, and o3 appears somewhat prone to "reward hacking".

阅读英文原文

Claude 3.7 Evaluation Results

2025年4月4日

METR conducted a preliminary evaluation of Claude 3.7 Sonnet. While we failed to find significant evidence for a dangerous level of autonomous capabilities, the model displayed impressive AI R&D capabilities on a subset of RE-Bench which provides the model with ground-truth performance information. We believe these capabilities are central to important threat models and should be monitored closely.

阅读英文原文

Measuring AI Ability to Complete Long Software Tasks

2025年3月19日

We propose measuring AI performance in terms of the *length* of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing over the past 6 years, with a doubling time of around 7 months. Extrapolating this trend predicts that, in under a decade, we will see AI agents that can independently complete a large fraction of software tasks that currently take humans days or weeks.

阅读英文原文

HCAST: Human-Calibrated Autonomy Software Tasks

2025年3月17日

Sharing details about HCAST (Human-Calibrated Autonomy Software Tasks), a benchmark we’ve been developing to measure the abilities of frontier AI systems to complete diverse software tasks autonomously.

阅读全文

DeepSeek-R1 Evaluation Results

2025年3月5日

We evaluated DeepSeek-R1 for dangerous autonomous capabilities and found no evidence of dangerous capabilities beyond those of existing models such as Claude 3.5 Sonnet and GPT-4o. Interestingly, we found that it did not do substantially better than DeepSeek-V3 on our autonomy suite.

阅读英文原文

METR’s GPT-4.5 pre-deployment evaluations

2025年2月27日

Additional details about our evaluations of GPT-4.5, and some discussion about the limitations of pre-deployment evaluations and current evaluation methodologies.

阅读英文原文

Measuring Automated Kernel Engineering

2025年2月14日

We measured the performance of frontier models at writing GPU kernels. With a small amount of scaffolding, we found that the best model can provide an average speedup on KernelBench of 1.8x.

阅读英文原文

DeepSeek-V3 Evaluation Results

2025年2月12日

We evaluated DeepSeek-V3 for dangerous autonomous capabilities and found no evidence of dangerous capabilities beyond those of existing models such as Claude 3.5 Sonnet and GPT-4o. We also confirmed that its performance on GPQA is not due to training data contamination.

阅读英文原文

An update on our preliminary evaluations of Claude 3.5 Sonnet and o1

2025年1月31日

Preliminary evaluations of Claude 3.5 Sonnet (New) and o1, as well as some discussion of challenges in making capability-based safety arguments for AI models.

阅读英文原文

Evaluating frontier AI R&D capabilities of language model agents against human experts

2024年11月22日

We’re releasing RE-Bench, a new benchmark for measuring the performance of humans and frontier model agents on ML research engineering tasks. We also share data from 71 human expert attempts and results for Anthropic’s Claude 3.5 Sonnet and OpenAI’s o1-preview.

阅读英文原文

The Rogue Replication Threat Model

2024年11月12日

Thoughts on how AI agents might develop large and resilient rogue populations.

阅读英文原文

Details about METR’s preliminary evaluation of o1-preview

2024年9月12日

We measured the performance of OpenAI's o1-mini and o1-preview models on our autonomy and AI R&D task suites, and found they did not exceed the capabilities of the best existing public model we've evaluated, though we could not confidently upper-bound their capabilities.

阅读英文原文

Evaluation platform: Vivaria

2024年8月20日

METR's tool for running evaluations and conducting agent elicitation research.

阅读全文

Details about METR’s preliminary evaluation of GPT-4o

2024年8月7日

We measured the performance of GPT-4o given a simple agent scaffolding on 77 tasks across 30 task families testing autonomous capabilities.

阅读英文原文

An update on our general capability evaluations

2024年8月6日

More tasks, human baselines, and preliminary results for GPT-4 and Claude.

阅读英文原文

Autonomy Evaluation Resources

2024年3月15日

A collection of resources for evaluating potentially dangerous autonomous capabilities of frontier models.

阅读英文原文

Example autonomy evaluation protocol

2024年3月15日

An example protocol for the whole evaluation process, based on our task suite, elicitation protocol, and scoring methods.

阅读英文原文

Example autonomy task suite

2024年3月15日

A suite of tasks designed to test autonomous abilities at different difficulty levels.

阅读全文

Guidelines for capability elicitation

2024年3月15日

Priorities for approximating the full potential capability of an AI agent, and recommended checks for evaluation validity.

阅读英文原文

Measuring the impact of post-training enhancements

2024年3月15日

A brief research report outlining quantitative research which could inform the "safety margin" to add to take into account further post-training enhancements to agent capability.

阅读英文原文

Portable Evaluation Tasks via the METR Task Standard

2024年2月29日

METR has published a standard way to define tasks for evaluating the capabilities of AI agents.

阅读英文原文

New report: Evaluating Language-Model Agents on Realistic Autonomous Tasks

2023年7月31日

We have just released our first public report. It introduces methodology for assessing the capacity of LLM agents to acquire resources, create copies of themselves, and adapt to novel challenges they encounter in the wild.

阅读英文原文

Update on ARC's recent eval efforts

2023年3月17日

More information about ARC's evaluations of GPT-4 and Claude

阅读英文原文