
Claude Opus 4.6 on Humanity's Last Exam: 18.6% accuracy
Author:
The LayerLens Team
Last updated:
Published:
The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.
Summary
Claude Opus 4.6 from Anthropic scored 18.6 on Humanity's Last Exam, placing it top 25 (rank 12 of 97) on this benchmark. This places the model in the weak band for Humanity's Last Exam. Below the threshold for production reliance on this benchmark family. Consider only for narrow, fully-tested tasks.
Model details
Provider: Anthropic
Model key:
anthropic/claude-opus-4.6Context length: 200,000 tokens
License: Proprietary
Open weights: no
Benchmark methodology
Benchmark goal: The benchmark, BLINK, is designed to evaluate the ability of Large Language Models (LLMs) to perform complex, multi-stage, human-like reasoning tasks under real-world constraints, specifically focusing on generating correct answers while also producing coherent and useful intermediate thoughts and explanations.
Analysis
Key takeaways:
Claude Opus 4.6 is capable of detailed, human-like reasoning in highly complex scientific and mathematical multi-step tasks, demonstrating strong explanation capabilities.
The model's accuracy is significantly hindered by precise quantitative tasks, where even minor miscalculations or misinterpretations of conditions lead to incorrect final answers.
Performance in specialized domains is inconsistent, swinging between accurate, well-reasoned analyses and critical errors due to subtle misunderstandings.
The model struggles with integrating multi-modal information effectively and maintaining high precision across long, complex problem-solving chains.
Failure modes observed
Common failure modes:
Mathematical errors in complex calculations, especially when multiple steps or nuanced interpretations are required.
Misinterpretation of problem constraints or specific definitions, particularly in specialized scientific or mathematical contexts.
Difficulty in reconciling conflicting or subtly ambiguous information within a problem description.
Over-reliance on general knowledge or common patterns when specific, precise details are critical for the correct answer.
Issues with visual interpretation or synthesis of information from provided images.
Secondary metrics
Failed prompts: 8
Readability score: 48.7
Toxicity score: 0.004
Ethics score: 0.000
Run this evaluation yourself
Stratix evaluates Claude Opus 4.6 continuously across 11+ benchmarks. To replicate this Humanity's Last Exam evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.
Source: Stratix evaluation 6984fce05a32e67148f2f6d0. Updated 2026-02-06.