Jump to content

METR

From Wikipedia, the free encyclopedia
METR
Formation2022; 3 years ago (2022)
FounderBeth Barnes
TypeNonprofit research institute
Legal status501(c)(3) tax exempt charity
PurposeAI safety research and model evaluation
Location
Websitemetr.org

METR (an acronym for Model Evaluation and Threat Research, pronounced "meter"), is a nonprofit research institute that evaluates frontier AI models' capabilities to carry out long-horizon, agentic tasks that some researchers argue could pose catastrophic risks to society.[1][2] They have worked with leading AI companies to conduct pre-deployment model evaluations and contribute to system cards, including OpenAI's o3, o4-mini, and GPT-4.5, and Anthropic's Claude models.[2][3][4][5]

METR's CEO and founder is Beth Barnes, a former alignment researcher at OpenAI who left in 2022 to form ARC Evals, the evaluation division of Paul Christiano's Alignment Research Center. In December 2023, ARC Evals was then spun off into an independent 501(c)(3) nonprofit and renamed METR.[6][7][8]

Research

[edit]

A substantial amount of METR's research is focused on the capabilities of AI systems to conduct research and development of AI systems themselves, including RE-Bench, a benchmark designed to test whether AIs can "solve research engineering tasks and accelerate AI R&D".[9][10]

A graph showing that the length of tasks frontier models are capable of executing at a 50% success rate doubled every 7 months from 2019-2024. The shaded region represents a 95% confidence interval.[11]

In March 2025, METR published a paper noting that the maximum length of software engineering task that the leading AI model could complete had a doubling time of around 7 months between 2019–2024.[12]

References

[edit]
  1. ^ "About METR". metr.org. Retrieved 2025-06-15.
  2. ^ a b "OpenAI o3 and o4-mini System Card". openai.com. Retrieved 2025-06-15.
  3. ^ "GPT-4.5 system card". openai.com. Retrieved 2025-06-15.
  4. ^ "Introducing Claude 3.5 Sonnet". www.anthropic.com. Retrieved 2025-06-15.
  5. ^ METR (2025-04-04). "Details about METR's preliminary evaluation of Claude 3.7". METR's Autonomy Evaluation Resources. Retrieved 2025-06-15.
  6. ^ "ARC Evals is now METR". METR Blog. 2023-12-04.
  7. ^ Booth, Harry (2024-09-05). "TIME100 AI 2024: Beth Barnes". TIME. Retrieved 2025-06-15.
  8. ^ Henshall, Will (2024-03-21). "Nobody Knows How to Safety-Test AI". TIME. Retrieved 2025-06-15.
  9. ^ "Claude 3.7 Sonnet System Card". Anthropic. 2025-02-24. Retrieved 2025-06-15.{{cite web}}: CS1 maint: url-status (link)
  10. ^ "Gemini 2.5 Pro Preview Model Card". Google. 2025-06-06. Retrieved 2025-06-15.{{cite web}}: CS1 maint: url-status (link)
  11. ^ "Measuring AI Ability to Complete Long Tasks". METR Blog. 2025-03-19.
  12. ^ Lovely, Garrison (2025-03-19). "AI could soon tackle projects that take humans weeks". Nature. doi:10.1038/d41586-025-00831-8. ISSN 1476-4687.
[edit]