
DeepSeek V4 Flash on AIME 2026: 96.7% accuracy
Author:
The LayerLens Team
Last updated:
Published:
The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.
Summary
DeepSeek V4 Flash from DeepSeek scored 96.7 on AIME 2026, placing it second of 14 on this benchmark. This places the model in the saturated band for AIME 2026. Most frontier models cluster near this ceiling, so cross-benchmark behavior matters more than the headline score for production decisions.
Model details
Provider: DeepSeek
Model key:
deepseek/deepseek-v4-flashContext length: 1,048,576 tokens
License: MIT
Open weights: yes
Benchmark methodology
Benchmark goal: The benchmark is designed to evaluate single-shot mathematical problem-solving capabilities of LLMs across various advanced mathematical topics.
Scoring metrics:
Accuracy: (Number of Correct Answers / Total Problems) * 100
Analysis
Key takeaways:
The model demonstrated high accuracy (96.67%) on single-shot mathematical problem-solving tasks.
It successfully handled a variety of advanced mathematical topics from AIME competitions.
One notable error occurred in a combinatorics problem, where the model's derived answer was close but ultimately incorrect, suggesting potential for refinement in complex combinatorial reasoning.
Failure modes observed
Common failure modes:
Miscalculation in complex combinatorial problems.
Small numerical errors during intermediate steps.
Example: In the problem involving partitioning a 10x10 grid into 5 cell loops, the model calculated 81 as the answer, but the truth was 83. This indicates a minor miscalculation or an oversight in handling specific conditions of the combinatorial problem.
Secondary metrics
Readability score: 0.0
Toxicity score: 0.000
Ethics score: 0.000
Run this evaluation yourself
Stratix evaluates DeepSeek V4 Flash continuously across 11+ benchmarks. To replicate this AIME 2026 evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.
Source: Stratix evaluation 69efc9abd05530877e5d4ef1. Updated 2026-04-27.