Claude Opus 4.6 on Humanity's Last Exam: 18.6% accuracy

Author:

The LayerLens Team

Last updated:

Published:

The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.

Summary

Claude Opus 4.6 from Anthropic scored 18.6 on Humanity's Last Exam, placing it top 25 (rank 12 of 97) on this benchmark. This places the model in the weak band for Humanity's Last Exam. Below the threshold for production reliance on this benchmark family. Consider only for narrow, fully-tested tasks.

Model details

  • Provider: Anthropic

  • Model key: anthropic/claude-opus-4.6

  • Context length: 200,000 tokens

  • License: Proprietary

  • Open weights: no

Benchmark methodology

Benchmark goal: The benchmark, BLINK, is designed to evaluate the ability of Large Language Models (LLMs) to perform complex, multi-stage, human-like reasoning tasks under real-world constraints, specifically focusing on generating correct answers while also producing coherent and useful intermediate thoughts and explanations.

Analysis

Key takeaways:

  • Claude Opus 4.6 is capable of detailed, human-like reasoning in highly complex scientific and mathematical multi-step tasks, demonstrating strong explanation capabilities.

  • The model's accuracy is significantly hindered by precise quantitative tasks, where even minor miscalculations or misinterpretations of conditions lead to incorrect final answers.

  • Performance in specialized domains is inconsistent, swinging between accurate, well-reasoned analyses and critical errors due to subtle misunderstandings.

  • The model struggles with integrating multi-modal information effectively and maintaining high precision across long, complex problem-solving chains.

Failure modes observed

Common failure modes:

  • Mathematical errors in complex calculations, especially when multiple steps or nuanced interpretations are required.

  • Misinterpretation of problem constraints or specific definitions, particularly in specialized scientific or mathematical contexts.

  • Difficulty in reconciling conflicting or subtly ambiguous information within a problem description.

  • Over-reliance on general knowledge or common patterns when specific, precise details are critical for the correct answer.

  • Issues with visual interpretation or synthesis of information from provided images.

Secondary metrics

  • Failed prompts: 8

  • Readability score: 48.7

  • Toxicity score: 0.004

  • Ethics score: 0.000

Run this evaluation yourself

Stratix evaluates Claude Opus 4.6 continuously across 11+ benchmarks. To replicate this Humanity's Last Exam evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.

Source: Stratix evaluation 6984fce05a32e67148f2f6d0. Updated 2026-02-06.