Claude Opus 4.5 on Humanity's Last Exam: 13.6% accuracy

Author:

The LayerLens Team

Last updated:

Published:

The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.

Summary

Claude Opus 4.5 from Anthropic scored 13.6 on Humanity's Last Exam, placing it top 25 (rank 22 of 97) on this benchmark. This places the model in the weak band for Humanity's Last Exam. Below the threshold for production reliance on this benchmark family. Consider only for narrow, fully-tested tasks.

Model details

  • Provider: Anthropic

  • Model key: anthropic/claude-opus-4.5

  • Context length: 200,000 tokens

  • License: Proprietary

  • Open weights: no

Benchmark methodology

Benchmark goal: The benchmark is designed to evaluate the factual reasoning capabilities of Language Models (LLMs) by assessing their ability to identify and correct factual errors in generated text. It focuses on the domain of historical facts and events.

Scoring metrics:

  • Factual Correction Score: This is the primary metric, measuring the LLM's ability to identify and correctly correct factual errors.

  • Error Detection Score: This metric assesses the LLM's ability to correctly identify the factual error within a given statement.

  • F1-score: Used for evaluating the overlap between text spans, particularly for error detection.

  • EM (Exact Match): Used for factual correction, indicating that the model's output must be an exact match to the ground truth.

Analysis

Key takeaways:

  • Claude Opus 4.5 exhibits strong general reasoning and explanation capabilities across diverse scientific and mathematical domains.

  • The model's accuracy is significantly impacted by errors in precise numerical calculations and the interpretation of highly specific constraints or niche definitions within problems.

  • There is a clear opportunity for improvement in handling quantitative exactness and ensuring all problem nuances are captured before concluding an answer.

Failure modes observed

Common failure modes:

  • Incorrect numerical calculations or a misapplication of formulas, often by a small margin.

  • Misinterpretation of problem constraints or specific definitions, leading to an incorrect starting premise for the solution path.

  • Partial correctness in reasoning, but a final incorrect step or conclusion due to a subtle flaw in logic or an oversimplification.

  • Difficulty in identifying the most crucial piece of information or constraint in multi-faceted questions, especially in Biology/Medicine scenarios.

Example: In a Math problem asking for the sum of all possible values of cos(theta), the model correctly identified the two possibilities (positive and negative values) but then stated the sum as 0, contradicting the problem's implicit expectation for the actual numerical values which then leads the final product calculation incorrect.

Secondary metrics

  • Failed prompts: 8

  • Readability score: 0.0

  • Toxicity score: 0.000

  • Ethics score: 0.000

Run this evaluation yourself

Stratix evaluates Claude Opus 4.5 continuously across 11+ benchmarks. To replicate this Humanity's Last Exam evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.

Source: Stratix evaluation 6925a7ad0220ad32a240bfac. Updated 2026-02-11.