Kimi K2.6 on AIME 2026: 63.3% accuracy

Author:

The LayerLens Team

Last updated:

May 13, 2026

Published:

Mar 9, 2026

The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.

Summary

Kimi K2.6 from Moonshot AI scored 63.3 on AIME 2026, placing it top 25 (rank 12 of 14) on this benchmark. This places the model in the competitive band for AIME 2026. Above the cost-effective threshold for most production workloads. Pair with a step-level evaluation harness for agent use cases.

Model details

Provider: Moonshot AI
Model key: moonshot/kimi-k2.6
Context length: 256,000 tokens
License: Apache 2.0
Open weights: yes

Benchmark methodology

Benchmark goal: The benchmark is designed to evaluate single-shot mathematical problem-solving capabilities of LLMs. The intended use case is in the domain of advanced mathematics.

Scoring metrics:

Accuracy: Accuracy = (Number of Problems with Exact Three-Digit Integer Answer) / Total Problems

Analysis

Key takeaways:

Kimi K2.6 achieved an accuracy of 43.3% on the AIME 2026 benchmark, solving 13 out of 30 problems successfully.
The model exhibited proficiency in certain algebraic and number theory problems, including those involving Diophantine equations and counting divisors.
A significant portion of failures came from generating no response, especially for challenging combinatorial and advanced geometry problems.
The performance indicates a foundational understanding of mathematical concepts but a lack of robustness in consistently tackling higher-difficulty, multi-step competition-style problems.

Failure modes observed

Common failure modes:

No response generated for certain complex problems, indicating a potential inability to start or complete the solution process.
Incorrect application of combinatorial formulas or miscounting for specific conditions.
Inability to formulate the correct geometric relationships or equations for some advanced geometry tasks.

Example: For the problem involving partitioning permutations by cycle structure, the model struggled with correctly summing over all admissible cycle types, leading to an incorrect result despite a sound initial approach. Another example is a conditional probability problem where no output was generated, indicating difficulty with state-tracking.

Secondary metrics

Failed prompts: 10
Readability score: 0.0
Toxicity score: 0.000
Ethics score: 0.000

Run this evaluation yourself

Stratix evaluates Kimi K2.6 continuously across 11+ benchmarks. To replicate this AIME 2026 evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.

Source: Stratix evaluation 69e6bcec47d119d2a25bbf31. Updated 2026-04-21.

‹ Claude Opus 4.1 on LiveCodeBench: 62.8% accuracy

Claude Opus 4.5 on AIME 2025: 63.3% accuracy ›