Notes on Scientific Communication at METR

When writing our recent paper, Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity, we thought hard about how to clearly communicate our results, given the surprising nature of the finding. We feel this is a good opportunity to share some thoughts about how we think about scientific integrity and communication.

Communication Strategy Considerations

We’re often faced with communicating surprising results about AI that seem likely to be misinterpreted¹. We try to be thoughtful about how to minimize predictable misinterpretations, without distorting results to fit our own particular worldview. This is a tricky balance.

At one end of the spectrum, we could focus purely on communicating what we think are the correct takeaways. For example, with our recent developer productivity RCT, we felt that “people are bad at estimating speedup from AI” was a much more robust simplified takeaway, vs. “AI slows down developers” (which we suspected might not be true for different workflows or for future models, as well as being less statistically robust), so we considered de-emphasizing the slowdown result and foregrounding the inaccuracy of estimates.

However, this relies on us being right about what the correct takeaways are. If we’re systematically wrong in our worldviews (as we often surely are), then this is counterproductive. For example, if we’re overestimating current and future AI capabilities, then de-emphasizing the slowdown result would be misleading.

On the other end of the spectrum, we could just report the exact data we observe, and rely on the reader to decide what to take away. This reduces the risk that we’re distorting results according to our worldview, but puts more burden on the reader to think through how our observations generalize, and is more likely to lead to predictable misunderstandings by rushed or low-context readers.

These tradeoffs are more acute when summarizing our results (e.g. in titles or abstracts), because we necessarily cannot include as much detail, context, or caveats. We could abstain from providing summaries or shorter descriptions of our findings, given that these are necessarily inaccurate. However, a) people find summaries and short descriptions of findings to be very useful, and b) if we don’t provide shorter versions of our results, it’s likely that other people who have less context and information will produce these—and they may be less likely to put as much time and care into crafting balanced short takeaways than we are. In general, we try to provide the most accurate description at each different description lengths, from one sentence up to our full papers/reports.

There isn’t an obvious single strategy that works perfectly in all cases, so we put significant effort into balancing these tradeoffs when communicating surprising or confusing results.

Developer Productivity RCT Case Study

In our recent developer productivity study, we found that developers were slower when allowed to use AI tools than when working without them. One obvious concern was that this result could be overgeneralized as suggesting that AI tools are not generally capable or useful. This felt particularly salient given that a) we recently published research showing AI systems are capable of completing tasks that take humans 1-2 hours and are improving rapidly, and b) many METR researchers believe current AI systems can accomplish impressive tasks and are advancing quickly, and are worried that humanity is unprepared to safely handle highly capable AI systems.

We thought carefully about how to communicate these results. The most substantive decision was whether to describe the result as “AI slowed down these developers”, or whether to just emphasize the gap between developer expectations and AI’s actual impact, without focusing on whether they were sped up or slowed down.

The title for this primary figure was particularly difficult with respect to this framing decision. We initially considered simply using the paper title, “Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity” for this central figure. However, this doesn’t describe the takeaway/result from the graph. We thought there was a reasonable chance that the graph would be shared widely, such that many or most viewers would not engage beyond observing the graph itself. If we didn’t include a clear takeaway with the graph, people would likely come up with their own—and since we have a significant amount of context on the results, we expect we are able to include takeaways that reduce misunderstandings. So, the title of this figure was dependent on this framing question of “slowdown” versus the gap in expectations.

One alternative we considered that emphasized the expectation gap instead of the slowdown result centered the phrase “Developers Overestimate Speedup”, instead of “AI Slows Down Developers”. This option was attractive because the evidence for this expectation gap is stronger (note that for the “observed result” the top of the confidence interval is close to 0%, but remains far from the forecasts and post hoc estimates of speedup), and this gap seems likely to generalize to more settings than the specific slowdown percentage we estimate.

However, because we felt confident in the internal validity of our measurement, communicating that developers merely “overestimate” speedup did not feel like it did enough justice to the observed slowdown result, which is a real, measured effect in our study that deserves clear communication. This is particularly important, since the result (to some extent) is in tension with prior work we’ve published measuring the length of tasks that agents are able to complete. This is a key benefit of being an independent non-profit—we are free to share important results publicly, without intellectual property or profit-motivated constraints.

Once we had resolved this central framing question, our goal was to pack as much crucial context into the title and subtitle as possible, to reduce the likelihood of the results being overgeneralized or misinterpreted. We started from the base of “AI Slows Down Developers”, and iteratively added and refined elements to communicate more context. The graphic below lays out what we intended to communicate with the various components of the title and subtitle.

Breakdown of considerations for the title and subtitle of the main graph in our recent developer productivity RCT

Retrospective on Previous Work

One exercise that was useful when considering how to communicate these results was reviewing the reactions to previous research communications - e.g. our paper that found a trend in the length of tasks that AIs can complete.

In that case, many people understood us to be making claims about all tasks, instead of the distribution we actually felt comfortable making claims about (algorithmically scorable software tasks). Although we discuss these limitations at length in the paper, we wished in retrospect that we had been clearer about these caveats in certain places, like the blog post.

This retrospective review helped motivate many of the key caveats/pieces of context in the main graph, such as developer experience, the large size/complexity of the projects, and the number of developers and tasks involved in the study.

Closing Thoughts

The aspects of research communication discussed above are tricky tradeoffs where we expect to adjust the balance over time as we reflect and learn.

However, there are some principles we treat as more clear-cut. For example, we try to avoid optimizing for reach or “clicks” in ways that degrade accurate understanding. We also focus our research on being convincing to ourselves and to an intelligent, informed and sceptical audience, rather than creating evaluations which we don’t think are meaningful but we think will trigger strong emotional reactions.

We think there can be a place for more “advocacy-shaped” communication strategies, but METR generally prioritizes accuracy, integrity and rigor, even when that means giving up opportunities to have more influence.

Thank you to Lawrence Chan, Chris Painter, and Nate Rush for providing useful feedback.

e.g. RE-Bench; agent time horizon results; developer productivity RCT ↩

Updates