Claude Opus 4.7 on AIME 2026: 90.0% accuracy

Author:

The LayerLens Team

Last updated:

May 13, 2026

Published:

Feb 27, 2026

The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.

Summary

Claude Opus 4.7 from Anthropic scored 90.0 on AIME 2026, placing it top 10 (rank 6 of 13) on this benchmark. This places the model in the saturated band for AIME 2026. Most frontier models cluster near this ceiling, so cross-benchmark behavior matters more than the headline score for production decisions.

Model details

Provider: Anthropic
Model key: anthropic/claude-opus-4.7
Context length: 1,000,000 tokens
License: Proprietary
Open weights: no

Benchmark methodology

Benchmark goal: The benchmark is designed to evaluate single-shot mathematical problem-solving capabilities of LLMs across various advanced mathematical topics including algebra, geometry, number theory, and combinatorics.

Scoring metrics:

Accuracy: Accuracy = (Number of Problems with Exact Three-Digit Integer Answers / Total Problems) * 100

Analysis

Key takeaways:

The model achieved a high accuracy of 93.33% on advanced mathematical problems.
It demonstrated robust mathematical reasoning, correctly solving complex problems in algebra, geometry, number theory, and combinatorics.
The primary failure mode observed was in highly intricate enumeration and recursive counting problems, particularly when precise handling of complex definitions and substructure interactions was required.
Despite high accuracy, certain combinatorial problems involving multi-level recursive definitions and subtle constraints remain challenging.

Failure modes observed

Common failure modes:

Misinterpretation of non-overlapping conditions in tiling problems, specifically regarding the "cell loop" definition and its implications for recursion.
Overlooking specific constraints for prime factorization, as seen in the 'number of ways to partition a 10x10 grid' question, leading to incorrect enumeration of sub-problem solutions.
Minor calculation errors in multi-step problems, such as a miscalculation in the number of possible values for 'h' in the geometry problem, leading to off-by-one errors.

Example: In the question to 'Find the number of ways to partition a 10x10 grid of cells into 5 cell loops...', the model struggled with the recursive definition of 'H(a,b)' which led to an incorrect value calculation for the number of ways.

Secondary metrics

Readability score: 0.0
Toxicity score: 0.000
Ethics score: 0.000

Run this evaluation yourself

Stratix evaluates Claude Opus 4.7 continuously across 11+ benchmarks. To replicate this AIME 2026 evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.

Source: Stratix evaluation 69e12d9c7857c3e44c75698b. Updated 2026-04-16.

‹ GPT-5 on LiveCodeBench: 81.7% accuracy

GLM 5.1 on AIME 2025: 90.0% accuracy ›