
Kimi K2.6 on AIME 2025: 56.7% accuracy
Author:
The LayerLens Team
Last updated:
Published:
The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.
Summary
Kimi K2.6 from Moonshot AI scored 56.7 on AIME 2025, placing it rank 64 of 140 on this benchmark. This places the model in the below-frontier band for AIME 2025. Acceptable for cost-sensitive workloads or as part of a multi-model ensemble. Not a default choice for high-stakes routing.
Model details
Provider: Moonshot AI
Model key:
moonshot/kimi-k2.6Context length: 256,000 tokens
License: Apache 2.0
Open weights: yes
Benchmark methodology
Benchmark goal: The benchmark is designed to evaluate single-shot problem-solving capabilities in advanced mathematics. The intended use case or domain is competitive mathematics problem-solving.
Scoring metrics:
Accuracy: A model receives credit only if it produces the exact three-digit integer answer for a problem. Accuracy = (Number of Problems with Exact Three-Digit Integer Answer / Total Problems) * 100
Analysis
Key takeaways:
Kimi K2.6 achieved an accuracy of 36.67% on the AIME 2025 benchmark, correctly solving 11 out of 30 problems.
The model performs well on problems testable through algebraic manipulation, elementary number theory, and straight-forward combinatorial counts.
Performance degrades significantly on geometry problems requiring complex spatial reasoning, problems involving advanced calculus concepts, and intricate combinatorial graph problems.
A notable strength is the consistent formatting of answers as per the strict guidelines, even in cases of incorrect solutions.
Failure modes observed
Common failure modes:
Incorrect application of geometric formulas or missing steps in complex multi-part geometry problems.
Errors in combinatorial counting, often overlooking specific constraints or miscalculating factorials/combinations.
Inability to derive or correctly interpret equations for complex functions or geometric transformations.
Arithmetic errors in the final calculation, even when the setup is mostly correct.
Example: In the question involving a heptagon AFNBCEM, the model correctly identifies proportional relationships and areas of similar triangles but does not proceed to calculate the area of the heptagon, indicating an incomplete solution.
Secondary metrics
Failed prompts: 10
Readability score: 0.0
Toxicity score: 0.000
Ethics score: 0.000
Run this evaluation yourself
Stratix evaluates Kimi K2.6 continuously across 11+ benchmarks. To replicate this AIME 2025 evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.
_Source: Stratix evaluation 69e6bcec47d119d2a25bbf30. Updated 2026-04-21._