Claude Opus 4.1 on AIME 2025: 26.7% accuracy

Author:

The LayerLens Team

Last updated:

May 13, 2026

Published:

Apr 12, 2026

The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.

Summary

Claude Opus 4.1 from Anthropic scored 26.7 on AIME 2025, placing it rank 84 of 139 on this benchmark. This places the model in the weak band for AIME 2025. Below the threshold for production reliance on this benchmark family. Consider only for narrow, fully-tested tasks.

Model details

Provider: Anthropic
Model key: anthropic/claude-opus-4.1
Context length: 200,000 tokens
License: Proprietary
Open weights: no

Benchmark methodology

Benchmark goal: The benchmark evaluates an LLM's ability to solve advanced mathematics problems from the American Invitational Mathematics Examination (AIME) 2025, focusing on its mathematical reasoning, problem-solving skills, and adherence to strict formatting requirements.

Analysis

Key takeaways:

Claude Opus 4.1 demonstrates strong performance on a subset of AIME 2025 Part I problems, achieving a perfect score on several tasks.
The model exhibits a mix of success and failure on more complex problems, particularly those requiring geometric reasoning or advanced counting techniques.
The model struggles on multipart problems and on questions where the solution requires a very specific insight or trick.

Failure modes observed

Common failure modes:

Incorrect application of geometric theorems
Errors in algebraic manipulation, particularly with complex expressions
Incomplete or flawed combinatorial reasoning
Difficulty with problems requiring a multi-step or insightful approach
Hallucinations in more complicated problems

Secondary metrics

Readability score: 61.0
Toxicity score: 0.009
Ethics score: 0.000

Run this evaluation yourself

Stratix evaluates Claude Opus 4.1 continuously across 11+ benchmarks. To replicate this AIME 2025 evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.

_Source: Stratix evaluation 68926753286991f0a67df2ac. Updated 2025-08-05._

‹ GPT-5 on Humanity's Last Exam: 21.7% accuracy

Gemini 3.1 Flash Lite Preview on AIME 2025: 30.0% accuracy ›