Shared components of AI lab commitments to evaluate and mitigate severe risks.
We measured the performance of GPT-4o given a simple agent scaffolding on 77 tasks across 30 task families testing autonomous capabilities.
More tasks, human baselines, and preliminary results for GPT-4 and Claude.
METR is hiring ML engineers and researchers.
Emma moves from President to Executive Director, Beth moves to Head of Research.
A collection of resources for evaluating potentially dangerous autonomous capabilities of frontier models.