Claude Opus 4.5 on AIME 2025: 63.3% accuracy

Author:

The LayerLens Team

Last updated:

May 13, 2026

Published:

Mar 7, 2026

The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.

Summary

Claude Opus 4.5 from Anthropic scored 63.3 on AIME 2025, placing it rank 55 of 139 on this benchmark. This places the model in the competitive band for AIME 2025. Above the cost-effective threshold for most production workloads. Pair with a step-level evaluation harness for agent use cases.

Model details

Provider: Anthropic
Model key: anthropic/claude-opus-4.5
Context length: 200,000 tokens
License: Proprietary
Open weights: no

Benchmark methodology

Benchmark goal: The benchmark aims to evaluate the factual recall and reasoning abilities of large language models (LLMs) across a diverse range of topics, specifically focusing on common knowledge, science, and math.

Scoring metrics:

accuracy: The percentage of correctly answered questions.

Analysis

Key takeaways:

Claude Opus 4.5 has a 60% accuracy rate on the AIME 2025 Part I/II subsets, solving 9 out of 15 presented problems correctly.
The model excels in problems that rely on direct application of mathematical principles, such as number theory, sequence analysis, and basic geometry.
Significant areas for improvement include combinatorics, advanced geometric probability, and precision in lengthy algebraic computations.
The model's performance suggests a strong foundational understanding in mathematics but indicates a need for enhanced error checking mechanisms in complex, multi-step problem-solving.

Failure modes observed

Common failure modes:

Misinterpretation of problem constraints, leading to incorrect base cases or assumptions.
Overlooking subtle details in geometric configurations or probability definitions.
Algebraic errors or miscalculations during lengthy derivations.
Incorrect application of specialized problem-solving techniques.

Example: In the parabola rotation problem, the model performed a correct rotation and attempted to find the intersection. However, during the algebraic expansion and simplification, it made an error that led to an incorrect quadratic equation, thus yielding the wrong values for a+b+c. The expansion of the squared term and comparison of coefficients was flawed, contributing to the final incorrect answer of 143 instead of 62.

Secondary metrics

Readability score: 70.2
Toxicity score: 0.012
Ethics score: 0.000

Run this evaluation yourself

Stratix evaluates Claude Opus 4.5 continuously across 11+ benchmarks. To replicate this AIME 2025 evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.

Source: Stratix evaluation 6924ef91f4af8f8917baa65a. Updated 2025-11-24.

‹ Kimi K2.6 on AIME 2026: 63.3% accuracy

Gemini 3.1 Flash Lite Preview on LiveCodeBench: 69.9% accuracy ›