METR

METR
Formation	2022; 3 years ago
Founder	Beth Barnes
Type	Nonprofit research institute
Legal status	501(c)(3) tax exempt charity
Purpose	AI safety research and model evaluation
Location	Berkeley, California;
Website	metr.org

METR (an acronym for Model Evaluation and Threat Research, pronounced "meter"), is a nonprofit research institute that evaluates frontier AI models' capabilities to carry out long-horizon, agentic tasks that some researchers argue could pose catastrophic risks to society.^[1]^[2] They have worked with leading AI companies to conduct pre-deployment model evaluations and contribute to system cards, including OpenAI's o3, o4-mini, and GPT-4.5, and Anthropic's Claude models.^[2]^[3]^[4]^[5]

METR's CEO and founder is Beth Barnes, a former alignment researcher at OpenAI who left in 2022 to form ARC Evals, the evaluation division of Paul Christiano's Alignment Research Center. In December 2023, ARC Evals was then spun off into an independent 501(c)(3) nonprofit and renamed METR.^[6]^[7]^[8]

Research

A substantial amount of METR's research is focused on the capabilities of AI systems to conduct research and development of AI systems themselves, including RE-Bench, a benchmark designed to test whether AIs can "solve research engineering tasks and accelerate AI R&D".^[9]^[10]

In March 2025, METR published a paper noting that the maximum length of software engineering task that the leading AI model could complete had a doubling time of around 7 months between 2019–2024.^[12]

References

^ "About METR". metr.org. Retrieved 2025-06-15.
^ ^a ^b "OpenAI o3 and o4-mini System Card". openai.com. Retrieved 2025-06-15.
^ "GPT-4.5 system card". openai.com. Retrieved 2025-06-15.
^ "Introducing Claude 3.5 Sonnet". www.anthropic.com. Retrieved 2025-06-15.
^ METR (2025-04-04). "Details about METR's preliminary evaluation of Claude 3.7". METR's Autonomy Evaluation Resources. Retrieved 2025-06-15.
^ "ARC Evals is now METR". METR Blog. 2023-12-04.
^ Booth, Harry (2024-09-05). "TIME100 AI 2024: Beth Barnes". TIME. Retrieved 2025-06-15.
^ Henshall, Will (2024-03-21). "Nobody Knows How to Safety-Test AI". TIME. Retrieved 2025-06-15.
^ "Claude 3.7 Sonnet System Card". Anthropic. 2025-02-24. Retrieved 2025-06-15.{{cite web}}: CS1 maint: url-status (link)
^ "Gemini 2.5 Pro Preview Model Card". Google. 2025-06-06. Retrieved 2025-06-15.{{cite web}}: CS1 maint: url-status (link)
^ "Measuring AI Ability to Complete Long Tasks". METR Blog. 2025-03-19.
^ Lovely, Garrison (2025-03-19). "AI could soon tackle projects that take humans weeks". Nature. doi:10.1038/d41586-025-00831-8. ISSN 1476-4687.

External links

Official website

[1] "About METR". metr.org. Retrieved 2025-06-15.

[:0-2] "OpenAI o3 and o4-mini System Card". openai.com. Retrieved 2025-06-15.

[3] "GPT-4.5 system card". openai.com. Retrieved 2025-06-15.

[4] "Introducing Claude 3.5 Sonnet". www.anthropic.com. Retrieved 2025-06-15.

[5] METR (2025-04-04). "Details about METR's preliminary evaluation of Claude 3.7". METR's Autonomy Evaluation Resources. Retrieved 2025-06-15.

[6] "ARC Evals is now METR". METR Blog. 2023-12-04.

[7] Booth, Harry (2024-09-05). "TIME100 AI 2024: Beth Barnes". TIME. Retrieved 2025-06-15.

[8] Henshall, Will (2024-03-21). "Nobody Knows How to Safety-Test AI". TIME. Retrieved 2025-06-15.

[9] "Claude 3.7 Sonnet System Card". Anthropic. 2025-02-24. Retrieved 2025-06-15.{{cite web}}: CS1 maint: url-status (link)

[10] "Gemini 2.5 Pro Preview Model Card". Google. 2025-06-06. Retrieved 2025-06-15.{{cite web}}: CS1 maint: url-status (link)

[11] "Measuring AI Ability to Complete Long Tasks". METR Blog. 2025-03-19.

[12] Lovely, Garrison (2025-03-19). "AI could soon tackle projects that take humans weeks". Nature. doi:10.1038/d41586-025-00831-8. ISSN 1476-4687.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]