Kimi K2.6 on BIRD-CRITIC: 33.3% accuracy

Author:

The LayerLens Team

Last updated:

May 13, 2026

Published:

Apr 2, 2026

The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.

Summary

Kimi K2.6 from Moonshot AI scored 33.3 on BIRD-CRITIC, placing it top 50 (rank 26 of 25) on this benchmark. This places the model in the weak band for BIRD-CRITIC. Below the threshold for production reliance on this benchmark family. Consider only for narrow, fully-tested tasks.

Model details

Provider: Moonshot AI
Model key: moonshot/kimi-k2.6
Context length: 256,000 tokens
License: Apache 2.0
Open weights: yes

Benchmark methodology

Benchmark goal: BIRD-CRITIC is designed to evaluate text-to-SQL generation capabilities, specifically focusing on a wide range of SQL operations and addressing various user interaction issues in a multi-turn conversational setting. It aims to assess how well models can translate natural language queries into accurate SQL, considering complexity and real-world applicability.

Analysis

Key takeaways:

The Kimi K2.6 model demonstrates an overall pass/fail rate of 17% on the BIRD-CRITIC benchmark (17 out of 100 prompts passed).
The model performs reasonably well on tasks requiring standard SQL joins, basic aggregations, and some advanced JSONB manipulations.
Significant weaknesses are observed in handling complex temporal queries, intricate recursive CTEs for graph traversal or stateful calculations, and nuanced ON CONFLICT strategies for upserts.
The model frequently struggles with prompts involving conditional logic that requires careful parsing of dependencies or multi-step reasoning.
Opportunities for improvement lie in enhancing its understanding of advanced window function applications and in refining its ability to construct correct recursive queries.

Failure modes observed

Common failure modes:

Incorrect logic in recursive CTEs: Many recursive queries for hierarchical data or cumulative calculations produced incomplete or erroneous results.
Misapplication of ON CONFLICT clauses: The model frequently failed to use ON CONFLICT and DO UPDATE/DO NOTHING correctly.
Issues with date and time-based filtering/aggregation: Several prompts involving date arithmetic resulted in incorrect filtering or aggregation.
Suboptimal JSONB pathing and manipulation: Complex scenarios involving conditional hashing of nested arrays occasionally resulted in incorrect outputs.
Failure to correctly use DISTINCT ON or RANK() for unique grouping/ordering.

Secondary metrics

Readability score: 0.0
Toxicity score: 0.000
Ethics score: 0.000

Run this evaluation yourself

Stratix evaluates Kimi K2.6 continuously across 11+ benchmarks. To replicate this BIRD-CRITIC evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.

Source: Stratix evaluation 69e78c5fa232fc11cd5f8df5. Updated 2026-04-22.

‹ Llama 4 Scout on LiveCodeBench: 33.2% accuracy

GPT-5 on Terminal-Bench (Terminus-1): 33.8% accuracy ›