DeepSeek V4 Flash on BIRD-CRITIC: 32.7% accuracy

Author:

The LayerLens Team

Last updated:

May 13, 2026

Published:

Apr 6, 2026

The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.

Summary

DeepSeek V4 Flash from DeepSeek scored 32.7 on BIRD-CRITIC, placing it top 10 (rank 4 of 25) on this benchmark. This places the model in the weak band for BIRD-CRITIC. Below the threshold for production reliance on this benchmark family. Consider only for narrow, fully-tested tasks.

Model details

Provider: DeepSeek
Model key: deepseek/deepseek-v4-flash
Context length: 1,048,576 tokens
License: MIT
Open weights: yes

Benchmark methodology

Benchmark goal: BIRD-CRITIC is designed to evaluate the performance of models in text-to-SQL tasks, specifically focusing on a comprehensive range of SQL operations and addressing user issues in multi-turn interactions.

Analysis

Key takeaways:

DeepSeek V4 Flash demonstrates a solid foundation in generating standard SQL queries for common data manipulation and retrieval tasks.
The model struggles significantly with tasks involving complex temporal logic, advanced window functions, and intricate set-based operations.
Performance is notably weaker in scenarios requiring dynamic schema interaction or precise control over unique constraints and triggers.
While capable of basic array and JSONB operations, advanced transformations within these structures often result in errors.

Failure modes observed

Common failure modes:

Incorrect window function usage: Many errors stem from misapplying ROW_NUMBER(), LAG(), LEAD(), or custom aggregates within complex partitioning schemes.
Temporal logic errors: Miscalculations or incorrect filtering when dealing with dates and times.
JSON/JSONB manipulation issues: Errors arise in handling complex nested JSONB structures.
Constraint and trigger logic: Functions and triggers often fail due to unique constraint violations.
Subquery and CTE inefficiencies: Using subqueries or CTEs in ways that lead to incorrect results.

Secondary metrics

Readability score: 0.0
Toxicity score: 0.000
Ethics score: 0.000

Run this evaluation yourself

Stratix evaluates DeepSeek V4 Flash continuously across 11+ benchmarks. To replicate this BIRD-CRITIC evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.

Source: Stratix evaluation 69efc9abd05530877e5d4eeb. Updated 2026-04-27.

‹ Gemini 3.1 Pro Preview on Terminal-Bench (Terminus-1): 32.5% accuracy

Llama 4 Scout on LiveCodeBench: 33.2% accuracy ›