DeepSeek V4 Flash on AIME 2026: 96.7% accuracy

Author:

The LayerLens Team

Last updated:

May 13, 2026

Published:

Feb 18, 2026

The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.

Summary

DeepSeek V4 Flash from DeepSeek scored 96.7 on AIME 2026, placing it second of 14 on this benchmark. This places the model in the saturated band for AIME 2026. Most frontier models cluster near this ceiling, so cross-benchmark behavior matters more than the headline score for production decisions.

Model details

Provider: DeepSeek
Model key: deepseek/deepseek-v4-flash
Context length: 1,048,576 tokens
License: MIT
Open weights: yes

Benchmark methodology

Benchmark goal: The benchmark is designed to evaluate single-shot mathematical problem-solving capabilities of LLMs across various advanced mathematical topics.

Scoring metrics:

Accuracy: (Number of Correct Answers / Total Problems) * 100

Analysis

Key takeaways:

The model demonstrated high accuracy (96.67%) on single-shot mathematical problem-solving tasks.
It successfully handled a variety of advanced mathematical topics from AIME competitions.
One notable error occurred in a combinatorics problem, where the model's derived answer was close but ultimately incorrect, suggesting potential for refinement in complex combinatorial reasoning.

Failure modes observed

Common failure modes:

Miscalculation in complex combinatorial problems.
Small numerical errors during intermediate steps.

Example: In the problem involving partitioning a 10x10 grid into 5 cell loops, the model calculated 81 as the answer, but the truth was 83. This indicates a minor miscalculation or an oversight in handling specific conditions of the combinatorial problem.

Secondary metrics

Readability score: 0.0
Toxicity score: 0.000
Ethics score: 0.000

Run this evaluation yourself

Stratix evaluates DeepSeek V4 Flash continuously across 11+ benchmarks. To replicate this AIME 2026 evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.

Source: Stratix evaluation 69efc9abd05530877e5d4ef1. Updated 2026-04-27.

‹ GLM 5.1 on AIME 2026: 93.3% accuracy

GPT-5 on AIME 2025: 96.7% accuracy ›