Claude Opus 4.6 on AIME 2025: 70.0% accuracy

Author:

The LayerLens Team

Last updated:

Published:

The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.

Summary

Claude Opus 4.6 from Anthropic scored 70.0 on AIME 2025, placing it top 50 (rank 48 of 139) on this benchmark. This places the model in the competitive band for AIME 2025. Above the cost-effective threshold for most production workloads. Pair with a step-level evaluation harness for agent use cases.

Model details

  • Provider: Anthropic

  • Model key: anthropic/claude-opus-4.6

  • Context length: 200,000 tokens

  • License: Proprietary

  • Open weights: no

Benchmark methodology

Benchmark goal: To evaluate the model's ability to solve complex mathematical problems from the AIME competition, requiring multi-step reasoning, geometric understanding, combinatorial counting, and number theory skills.

Scoring metrics:

  • score: Binary metric indicating whether the final numerical answer is correct (1) or incorrect (0).

  • duration: Time taken by the model to generate the response.

Analysis

Key takeaways:

  • The model demonstrates a solid foundation in various mathematical domains, particularly excelling in logic-driven problems such as number theory conditions and routine combinatorial calculations.

  • Its performance is notably strong on problems that can be broken down into clear, sequential algebraic or geometric steps.

  • However, performance degrades significantly when problems require more abstract conceptual leaps, intricate pattern recognition across multiple iterations, or highly sensitive algebraic manipulation.

  • The model's current reliability on AIME-level problems is moderate (around 70% accuracy), indicating a need for improvement in handling non-standard problem-solving approaches.

Failure modes observed

Common failure modes:

  • Algebraic errors in complex multi-step calculations, particularly when squaring expressions or simplifying square roots.

  • Misinterpretation of problem constraints or properties, leading to incorrect case analysis or formula application.

  • Difficulty in recognizing iterative patterns or symmetries in sequence problems, resulting in incorrect closed-form solutions.

  • Errors in combinatorial logic, especially when dealing with nested conditions or the Principle of Inclusion-Exclusion for specific subsets.

Example: In the 'parabola rotation' problem, the model attempted a complex algebraic expansion which led to an incorrect equation, ultimately resulting in a wrong final answer despite showing an understanding of the rotation matrix.

Secondary metrics

  • Readability score: 75.9

  • Toxicity score: 0.026

  • Ethics score: 0.000

Run this evaluation yourself

Stratix evaluates Claude Opus 4.6 continuously across 11+ benchmarks. To replicate this AIME 2025 evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.

Source: Stratix evaluation 6984fce05a32e67148f2f6cf. Updated 2026-02-05.