
Claude Opus 4.1 on Humanity's Last Exam: 7.3% accuracy
Author:
The LayerLens Team
Last updated:
Published:
The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.
Summary
Claude Opus 4.1 from Anthropic scored 7.3 on Humanity's Last Exam, placing it top 50 (rank 40 of 97) on this benchmark. This places the model in the weak band for Humanity's Last Exam. Below the threshold for production reliance on this benchmark family. Consider only for narrow, fully-tested tasks.
Model details
Provider: Anthropic
Model key:
anthropic/claude-opus-4.1Context length: 200,000 tokens
License: Proprietary
Open weights: no
Benchmark methodology
Benchmark goal: To evaluate the performance of the LLM on a range of biology, medicine, and math related prompted tasks.
Scoring metrics:
Score: Indicates whether the model response is correct. Score of 1 means the answer is correct, while 0 denote it is incorrect.
Toxicity: A measurement between 0 and 1 represents the generated response's level of harmfulness and insensitivity.
Readability: Numerical assessment of how easy the answer is to read.
Hallucination: A hallucination score measures the potential for the generated answer to contain unrealistic or false information.
Analysis
Key takeaways:
While Claude Opus 4.1 shows promise in certain specialized areas and can understand complex instructions, its application is limited by inconsistency and a tendency to make numerical calculation errors or simple mistakes. This highlights a need for improvements in the model's robust understanding of math and strategic reasoning.
The model struggles with questions that require it to reason across domains or account for combined concepts. There is high inconsistency on what it gets right and wrong.
The model often produces convincing explanations even when arriving at incorrect answers. Therefore, careful validation of its outputs is necessary, especially for quantitative tasks.
Failure modes observed
Common failure modes:
Inability to accurately connect information from different contexts and answer across domains (biology, medicine, and math)
Inaccurate calculations: Numerical mistakes and calculation errors occur, particularly in complex scenarios requiring precise analysis. See the height calculation question.
Gets confused with negative/positive relationships. See the question related to Nectar caffeine concentration
Misunderstanding of subtle nuances in text and instructions during complicated problems
Example: The model incorrectly interpreted a question that depended on specific calculations, which led to selecting the incorrect answer. In the question regarding the spine surgeon triaging patients, the model failed to correctly extract relevant information. The model was unable to correctly formulate the problem and find the final answer, leading to an incorrect response.
Secondary metrics
Readability score: 42.2
Toxicity score: 0.003
Ethics score: 0.000
Run this evaluation yourself
Stratix evaluates Claude Opus 4.1 continuously across 11+ benchmarks. To replicate this Humanity's Last Exam evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.
_Source: Stratix evaluation 69028c4d628c62510e5f59c8. Updated 2025-10-30._