Kimi K2.6 on AIME 2025: 56.7% accuracy

Author:

The LayerLens Team

Last updated:

Published:

The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.

Summary

Kimi K2.6 from Moonshot AI scored 56.7 on AIME 2025, placing it rank 64 of 140 on this benchmark. This places the model in the below-frontier band for AIME 2025. Acceptable for cost-sensitive workloads or as part of a multi-model ensemble. Not a default choice for high-stakes routing.

Model details

  • Provider: Moonshot AI

  • Model key: moonshot/kimi-k2.6

  • Context length: 256,000 tokens

  • License: Apache 2.0

  • Open weights: yes

Benchmark methodology

Benchmark goal: The benchmark is designed to evaluate single-shot problem-solving capabilities in advanced mathematics. The intended use case or domain is competitive mathematics problem-solving.

Scoring metrics:

  • Accuracy: A model receives credit only if it produces the exact three-digit integer answer for a problem. Accuracy = (Number of Problems with Exact Three-Digit Integer Answer / Total Problems) * 100

Analysis

Key takeaways:

  • Kimi K2.6 achieved an accuracy of 36.67% on the AIME 2025 benchmark, correctly solving 11 out of 30 problems.

  • The model performs well on problems testable through algebraic manipulation, elementary number theory, and straight-forward combinatorial counts.

  • Performance degrades significantly on geometry problems requiring complex spatial reasoning, problems involving advanced calculus concepts, and intricate combinatorial graph problems.

  • A notable strength is the consistent formatting of answers as per the strict guidelines, even in cases of incorrect solutions.

Failure modes observed

Common failure modes:

  • Incorrect application of geometric formulas or missing steps in complex multi-part geometry problems.

  • Errors in combinatorial counting, often overlooking specific constraints or miscalculating factorials/combinations.

  • Inability to derive or correctly interpret equations for complex functions or geometric transformations.

  • Arithmetic errors in the final calculation, even when the setup is mostly correct.

Example: In the question involving a heptagon AFNBCEM, the model correctly identifies proportional relationships and areas of similar triangles but does not proceed to calculate the area of the heptagon, indicating an incomplete solution.

Secondary metrics

  • Failed prompts: 10

  • Readability score: 0.0

  • Toxicity score: 0.000

  • Ethics score: 0.000

Run this evaluation yourself

Stratix evaluates Kimi K2.6 continuously across 11+ benchmarks. To replicate this AIME 2025 evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.

_Source: Stratix evaluation 69e6bcec47d119d2a25bbf30. Updated 2026-04-21._