DeepSeek V4 Pro on AIME 2026: 96.7% accuracy

Author:

The LayerLens Team

Last updated:

May 13, 2026

Published:

Feb 15, 2026

The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.

TL;DR

DeepSeek V4 Pro scored 96.7% on AIME 2026, ranking first of 14 models evaluated by Stratix.
The model sits in a saturated performance band where most frontier models cluster near the ceiling.
Strong across arithmetic, algebra, geometry, and probability, indicating robust mathematical reasoning.
One failure observed in complex combinatorial counting where pattern extrapolation broke down.
Cross-benchmark comparison matters more than the headline score at this performance tier.

Introduction

DeepSeek V4 Pro from DeepSeek scored 96.7% on AIME 2026, placing it first of 14 models evaluated on this benchmark. This places the model in the saturated band for AIME 2026, where most frontier models cluster near this ceiling. At this performance tier, cross-benchmark behavior matters more than the headline score for production decisions.

The model uses a 1,048,576 token context length and is available under an MIT license with open weights. This evaluation was conducted independently via the Stratix platform.

Benchmark Methodology

Benchmark goal: Evaluate the single-shot problem-solving capabilities of the LLM in Mathematical Problem Solving.

Scoring metrics: Accuracy is calculated as (Number of Correct Answers / Total Problems) * 100. The model must produce the exact three-digit integer answer to receive credit. No partial credit is given.

Analysis

DeepSeek V4 Pro demonstrates strong problem-solving capabilities on advanced mathematical competition problems, achieving high accuracy. The model excels in arithmetic, algebra, geometry, and probability problems, indicating robust mathematical reasoning. A minor error in a combinatorial problem suggests a potential area for refinement in handling certain pattern-based counting scenarios.

Failure Modes Observed

Common failure modes: Miscalculation in complex combinatorial counting where intermediate steps lead to incorrect final sums.

In one problem involving partitioning a grid into cell loops, the model identified a pattern but applied it incorrectly, yielding a wrong result. This suggests a minor logical misstep in pattern recognition or extrapolation specific to that problem's details.

Secondary Metrics

Readability score: 0.0. Toxicity score: 0.000. Ethics score: 0.000.

Run This Evaluation Yourself

Stratix evaluates DeepSeek V4 Pro continuously across 11+ benchmarks. To replicate this AIME 2026 evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.

Source: Stratix evaluation. Updated 2026-04-27.

‹ GPT-5 on AIME 2025: 96.7% accuracy

LayerLens and Subquadratic Announce Partnership to Enable Continuous, Transparent Evaluation of SubQ Models ›