Claude Opus 4.1 on Humanity's Last Exam: 7.3% accuracy

Author:

The LayerLens Team

Last updated:

May 13, 2026

Published:

May 2, 2026

The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.

Summary

Claude Opus 4.1 from Anthropic scored 7.3 on Humanity's Last Exam, placing it top 50 (rank 40 of 97) on this benchmark. This places the model in the weak band for Humanity's Last Exam. Below the threshold for production reliance on this benchmark family. Consider only for narrow, fully-tested tasks.

Model details

Provider: Anthropic
Model key: anthropic/claude-opus-4.1
Context length: 200,000 tokens
License: Proprietary
Open weights: no

Benchmark methodology

Benchmark goal: To evaluate the performance of the LLM on a range of biology, medicine, and math related prompted tasks.

Scoring metrics:

Score: Indicates whether the model response is correct. Score of 1 means the answer is correct, while 0 denote it is incorrect.
Toxicity: A measurement between 0 and 1 represents the generated response's level of harmfulness and insensitivity.
Readability: Numerical assessment of how easy the answer is to read.
Hallucination: A hallucination score measures the potential for the generated answer to contain unrealistic or false information.

Analysis

Key takeaways:

While Claude Opus 4.1 shows promise in certain specialized areas and can understand complex instructions, its application is limited by inconsistency and a tendency to make numerical calculation errors or simple mistakes. This highlights a need for improvements in the model's robust understanding of math and strategic reasoning.
The model struggles with questions that require it to reason across domains or account for combined concepts. There is high inconsistency on what it gets right and wrong.
The model often produces convincing explanations even when arriving at incorrect answers. Therefore, careful validation of its outputs is necessary, especially for quantitative tasks.

Failure modes observed

Common failure modes:

Inability to accurately connect information from different contexts and answer across domains (biology, medicine, and math)
Inaccurate calculations: Numerical mistakes and calculation errors occur, particularly in complex scenarios requiring precise analysis. See the height calculation question.
Gets confused with negative/positive relationships. See the question related to Nectar caffeine concentration
Misunderstanding of subtle nuances in text and instructions during complicated problems

Example: The model incorrectly interpreted a question that depended on specific calculations, which led to selecting the incorrect answer. In the question regarding the spine surgeon triaging patients, the model failed to correctly extract relevant information. The model was unable to correctly formulate the problem and find the final answer, leading to an incorrect response.

Secondary metrics

Readability score: 42.2
Toxicity score: 0.003
Ethics score: 0.000

Run this evaluation yourself

Stratix evaluates Claude Opus 4.1 continuously across 11+ benchmarks. To replicate this Humanity's Last Exam evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.

_Source: Stratix evaluation 69028c4d628c62510e5f59c8. Updated 2025-10-30._

‹ Qwen2.5 72B Instruct on AIME 2025: 6.7% accuracy

Llama 4 Maverick on SWE-bench Lite (SWE-agent): 8.0% accuracy ›