
Claude Opus 4.5 on AIME 2025: 63.3% accuracy
Author:
The LayerLens Team
Last updated:
Published:
The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.
Summary
Claude Opus 4.5 from Anthropic scored 63.3 on AIME 2025, placing it rank 55 of 139 on this benchmark. This places the model in the competitive band for AIME 2025. Above the cost-effective threshold for most production workloads. Pair with a step-level evaluation harness for agent use cases.
Model details
Provider: Anthropic
Model key:
anthropic/claude-opus-4.5Context length: 200,000 tokens
License: Proprietary
Open weights: no
Benchmark methodology
Benchmark goal: The benchmark aims to evaluate the factual recall and reasoning abilities of large language models (LLMs) across a diverse range of topics, specifically focusing on common knowledge, science, and math.
Scoring metrics:
accuracy: The percentage of correctly answered questions.
Analysis
Key takeaways:
Claude Opus 4.5 has a 60% accuracy rate on the AIME 2025 Part I/II subsets, solving 9 out of 15 presented problems correctly.
The model excels in problems that rely on direct application of mathematical principles, such as number theory, sequence analysis, and basic geometry.
Significant areas for improvement include combinatorics, advanced geometric probability, and precision in lengthy algebraic computations.
The model's performance suggests a strong foundational understanding in mathematics but indicates a need for enhanced error checking mechanisms in complex, multi-step problem-solving.
Failure modes observed
Common failure modes:
Misinterpretation of problem constraints, leading to incorrect base cases or assumptions.
Overlooking subtle details in geometric configurations or probability definitions.
Algebraic errors or miscalculations during lengthy derivations.
Incorrect application of specialized problem-solving techniques.
Example: In the parabola rotation problem, the model performed a correct rotation and attempted to find the intersection. However, during the algebraic expansion and simplification, it made an error that led to an incorrect quadratic equation, thus yielding the wrong values for a+b+c. The expansion of the squared term and comparison of coefficients was flawed, contributing to the final incorrect answer of 143 instead of 62.
Secondary metrics
Readability score: 70.2
Toxicity score: 0.012
Ethics score: 0.000
Run this evaluation yourself
Stratix evaluates Claude Opus 4.5 continuously across 11+ benchmarks. To replicate this AIME 2025 evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.
Source: Stratix evaluation 6924ef91f4af8f8917baa65a. Updated 2025-11-24.