Software engineering tasks serve as early warning signs
โWe think of it as trying to build advanced science that can say, when are we getting to the point that AI systems could improve themselves or speed up the pace of AI development? When will AI research feed on itself? The core capability for that might be software engineering and machine learning research ability.โ
METR remains bottlenecked by technical talent over compute
โI think clearly the central reason is that we are bottlenecked on technical talent, on incredibly capable people to come work on these questions. I was on a METR work retreat recently where we were brainstorming 20, 30 of these, what seemed like world important problems, problems that we think no one else is going to get to if we do not get to them.โ
Claude 4.6 handles 12-hour human engineering tasks
โIn this case, we're talking about for a bus 4.6, something like tasks that take humans 12 hours to do, we predict that it will succeed at those tasks around 50 percent of the time. It turns out that when you plot using this particular difficulty measure, how performant AIs are relative to how long it takes humans to complete these tasks, we see an exponential increase in capabilities for AIs.โ
AI capabilities double every four months on average
โAnd what that ends up meaning is that you keep on having these doublings of capabilities every, let's say, four months, it seems, on recent trends, where the next model is not merely going to have necessarily an hour longer time horizon, but perhaps be having some multiple of the time horizon of the previous model that's come out.โ
METR measures autonomy to predict catastrophic AI risk
โMETR is a research nonprofit based in the Bay Area... dedicated to advancing the science of measuring whether and when AI systems might pose catastrophic risks to humanity as a whole, focused specifically on threats that come from AI autonomy or AI systems themselves. We think it sets the stakes for conversations about AI misalignment.โ
โOne extraordinary fact from my perspective... is something like the R&D spend on compute of these companies has risen exponentially, of course, and in fact, it's risen exponentially at essentially the same rate as time horizon progress. You know, I think there's nothing necessary about that. You know, it doesn't mean by itself that if compute progress slows, then capabilities progress will also slow.โ
AI models struggle with messy real-world engineering friction
โThe tasks that come up in the wild are more likely to be messy in some sense. They involve working with other people. They involve working in much larger code bases or more open-ended problems, maybe with something even adversarial going on. We do tend to see that the AIs are less capable of working on these more messy problems.โ