AI Coding Autonomy Tracker
METR-style horizon chart showing how long coding tasks frontier models can complete autonomously. Y-axis values are shown in hours (converted from METR source-minute estimates); toggle between 50% and 80% reliability views and compare linear vs log scale.
Autonomous coding capability matters because it changes how quickly software can be built, how teams are staffed, and how costs move over time. As these systems improve, they also reshape which technical skills become most valuable and how engineering work is organized. At the national level, capability trajectories increasingly map to competitiveness in AI development, and at the personal level they influence the speed, quality, and safety of the digital tools people rely on every day.
What does “autonomous coding for an hour+” mean in practice?
METR-style tasks are real multi-step software assignments. The model is expected to plan, write code, run tests, debug failures, and deliver a working result with limited human intervention.
- Bug fixing in an unfamiliar codebase: locate root cause, patch code, and make tests pass.
- Feature implementation: add a new endpoint/UI behavior and wire it through existing modules safely.
- Refactor + reliability work: improve structure without breaking behavior, then validate with checks/tests.
- Tool-assisted engineering loops: run commands, interpret logs, and iterate until acceptance criteria are met.
In plain terms: this chart estimates how long a model can keep doing realistic software work correctly before it fails.
Source & update status
- Live data: Updated daily. Last refresh: Apr 06, 2026
- METR benchmark YAML artifact
- Last updated: —
- Caveats: none