Claude Opus 4.1 on AIME 2025: 26.7% accuracy

Author:

The LayerLens Team

Last updated:

Published:

The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.

Summary

Claude Opus 4.1 from Anthropic scored 26.7 on AIME 2025, placing it rank 84 of 139 on this benchmark. This places the model in the weak band for AIME 2025. Below the threshold for production reliance on this benchmark family. Consider only for narrow, fully-tested tasks.

Model details

  • Provider: Anthropic

  • Model key: anthropic/claude-opus-4.1

  • Context length: 200,000 tokens

  • License: Proprietary

  • Open weights: no

Benchmark methodology

Benchmark goal: The benchmark evaluates an LLM's ability to solve advanced mathematics problems from the American Invitational Mathematics Examination (AIME) 2025, focusing on its mathematical reasoning, problem-solving skills, and adherence to strict formatting requirements.

Analysis

Key takeaways:

  • Claude Opus 4.1 demonstrates strong performance on a subset of AIME 2025 Part I problems, achieving a perfect score on several tasks.

  • The model exhibits a mix of success and failure on more complex problems, particularly those requiring geometric reasoning or advanced counting techniques.

  • The model struggles on multipart problems and on questions where the solution requires a very specific insight or trick.

Failure modes observed

Common failure modes:

  • Incorrect application of geometric theorems

  • Errors in algebraic manipulation, particularly with complex expressions

  • Incomplete or flawed combinatorial reasoning

  • Difficulty with problems requiring a multi-step or insightful approach

  • Hallucinations in more complicated problems

Secondary metrics

  • Readability score: 61.0

  • Toxicity score: 0.009

  • Ethics score: 0.000

Run this evaluation yourself

Stratix evaluates Claude Opus 4.1 continuously across 11+ benchmarks. To replicate this AIME 2025 evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.

_Source: Stratix evaluation 68926753286991f0a67df2ac. Updated 2025-08-05._