
Claude Opus 4.1 on AIME 2025: 26.7% accuracy
Author:
The LayerLens Team
Last updated:
Published:
The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.
Summary
Claude Opus 4.1 from Anthropic scored 26.7 on AIME 2025, placing it rank 84 of 139 on this benchmark. This places the model in the weak band for AIME 2025. Below the threshold for production reliance on this benchmark family. Consider only for narrow, fully-tested tasks.
Model details
Provider: Anthropic
Model key:
anthropic/claude-opus-4.1Context length: 200,000 tokens
License: Proprietary
Open weights: no
Benchmark methodology
Benchmark goal: The benchmark evaluates an LLM's ability to solve advanced mathematics problems from the American Invitational Mathematics Examination (AIME) 2025, focusing on its mathematical reasoning, problem-solving skills, and adherence to strict formatting requirements.
Analysis
Key takeaways:
Claude Opus 4.1 demonstrates strong performance on a subset of AIME 2025 Part I problems, achieving a perfect score on several tasks.
The model exhibits a mix of success and failure on more complex problems, particularly those requiring geometric reasoning or advanced counting techniques.
The model struggles on multipart problems and on questions where the solution requires a very specific insight or trick.
Failure modes observed
Common failure modes:
Incorrect application of geometric theorems
Errors in algebraic manipulation, particularly with complex expressions
Incomplete or flawed combinatorial reasoning
Difficulty with problems requiring a multi-step or insightful approach
Hallucinations in more complicated problems
Secondary metrics
Readability score: 61.0
Toxicity score: 0.009
Ethics score: 0.000
Run this evaluation yourself
Stratix evaluates Claude Opus 4.1 continuously across 11+ benchmarks. To replicate this AIME 2025 evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.
_Source: Stratix evaluation 68926753286991f0a67df2ac. Updated 2025-08-05._