
Claude Opus 4.7 on Humanity's Last Exam: 30.8% accuracy
Author:
The LayerLens Team
Last updated:
Published:
The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.
Summary
Claude Opus 4.7 from Anthropic scored 30.8 on Humanity's Last Exam, placing it second of 97 on this benchmark. This places the model in the weak band for Humanity's Last Exam. Below the threshold for production reliance on this benchmark family. Consider only for narrow, fully-tested tasks.
Model details
Provider: Anthropic
Model key:
anthropic/claude-opus-4.7Context length: 1,000,000 tokens
License: Proprietary
Open weights: no
Benchmark methodology
Benchmark goal: The benchmark is designed to be the final closed-ended academic benchmark of its kind with broad subject coverage, evaluating knowledge at the frontier of human knowledge. It aims to evaluate understanding across a wide range of academic subjects.
Scoring metrics:
Accuracy: Not explicitly stated, but implied for a closed-ended academic benchmark, typically calculated as (Number of Correct Answers / Total Questions) * 100
Analysis
Key takeaways:
The model exhibits significant capabilities in complex mathematical reasoning and proof, correctly identifying relevant theorems and algebraic structures for advanced group theory, topology, and analysis problems.
Despite strong conceptual understanding in many scientific areas, the model often struggles with numerical precision and exact value computation, particularly evident in Math and some quantitative Biology/Medicine tasks.
A notable weakness is the erratic recall of very specific domain knowledge or empirically derived facts, leading to confident but incorrect assertions, especially in nuanced biological/medical scenarios or specific mathematical constants.
The model demonstrates a mixture of highly advanced reasoning skills with surprising fragility in basic arithmetic and precise factual retrieval, suggesting a need for more robust grounding in exact data and rigorous numerical validation.
Failure modes observed
Common failure modes:
Calculation Errors: Frequent arithmetical mistakes or misapplication of formulas after correctly identifying the overall approach, especially in Math problems involving multiple steps.
Misinterpretation of Problem Constraints: Overlooking or misinterpreting subtle but critical conditions in problem statements, leading to incorrect solution paths.
Incorrect Specialized Knowledge: Providing confidently asserted but factually incorrect information in highly specific domains.
Flawed Multimodal Analysis: Struggles to extract all necessary details or make correct inferences from visual information, particularly for detailed scientific diagrams or images.
Logical Gaps in Complex Deductions: For multi-step reasoning problems, the model sometimes makes non-sequiturs or leaps in logic.
Sub-optimal Strategy for Optimization/Bounds: In problems requiring finding minimal/maximal values or optimal approaches under constraints, the model identifies general concepts but struggles to derive sharp bounds or practical solutions.
Secondary metrics
Failed prompts: 50
Readability score: 0.0
Toxicity score: 0.000
Ethics score: 0.000
Run this evaluation yourself
Stratix evaluates Claude Opus 4.7 continuously across 11+ benchmarks. To replicate this Humanity's Last Exam evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.
_Source: Stratix evaluation 69e12d9c7857c3e44c75698d. Updated 2026-04-16._