
Kimi K2.6 on AIME 2026: 63.3% accuracy
Author:
The LayerLens Team
Last updated:
Published:
The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.
Summary
Kimi K2.6 from Moonshot AI scored 63.3 on AIME 2026, placing it top 25 (rank 12 of 14) on this benchmark. This places the model in the competitive band for AIME 2026. Above the cost-effective threshold for most production workloads. Pair with a step-level evaluation harness for agent use cases.
Model details
Provider: Moonshot AI
Model key:
moonshot/kimi-k2.6Context length: 256,000 tokens
License: Apache 2.0
Open weights: yes
Benchmark methodology
Benchmark goal: The benchmark is designed to evaluate single-shot mathematical problem-solving capabilities of LLMs. The intended use case is in the domain of advanced mathematics.
Scoring metrics:
Accuracy: Accuracy = (Number of Problems with Exact Three-Digit Integer Answer) / Total Problems
Analysis
Key takeaways:
Kimi K2.6 achieved an accuracy of 43.3% on the AIME 2026 benchmark, solving 13 out of 30 problems successfully.
The model exhibited proficiency in certain algebraic and number theory problems, including those involving Diophantine equations and counting divisors.
A significant portion of failures came from generating no response, especially for challenging combinatorial and advanced geometry problems.
The performance indicates a foundational understanding of mathematical concepts but a lack of robustness in consistently tackling higher-difficulty, multi-step competition-style problems.
Failure modes observed
Common failure modes:
No response generated for certain complex problems, indicating a potential inability to start or complete the solution process.
Incorrect application of combinatorial formulas or miscounting for specific conditions.
Inability to formulate the correct geometric relationships or equations for some advanced geometry tasks.
Example: For the problem involving partitioning permutations by cycle structure, the model struggled with correctly summing over all admissible cycle types, leading to an incorrect result despite a sound initial approach. Another example is a conditional probability problem where no output was generated, indicating difficulty with state-tracking.
Secondary metrics
Failed prompts: 10
Readability score: 0.0
Toxicity score: 0.000
Ethics score: 0.000
Run this evaluation yourself
Stratix evaluates Kimi K2.6 continuously across 11+ benchmarks. To replicate this AIME 2026 evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.
Source: Stratix evaluation 69e6bcec47d119d2a25bbf31. Updated 2026-04-21.