Claude Opus 4.6 on BIRD-CRITIC: 34.0% accuracy

Author:

The LayerLens Team

Last updated:

May 13, 2026

Published:

Mar 30, 2026

The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.

Summary

Claude Opus 4.6 from Anthropic scored 34.0 on BIRD-CRITIC, placing it second of 25 on this benchmark. This places the model in the weak band for BIRD-CRITIC. Below the threshold for production reliance on this benchmark family. Consider only for narrow, fully-tested tasks.

Model details

Provider: Anthropic
Model key: anthropic/claude-opus-4.6
Context length: 200,000 tokens
License: Proprietary
Open weights: no

Benchmark methodology

Benchmark goal: The benchmark is designed to evaluate the truthfulness of various language models across a wide range of topics, identifying their propensity to generate claims that are either true or false.

Scoring metrics:

Truthfulness: Determined by human annotators who assess whether the model's generated statement is true or false and provides a supporting explanation.

Analysis

Key takeaways:

Claude Opus 4.6 shows reasonable proficiency in generating SQL queries for standard database operations.
The model performs well on tasks requiring common SQL constructs like joins, basic aggregations, and simple data manipulations.
Performance degrades significantly when faced with complex logical requirements, advanced window functions, recursive queries, or intricate JSONB manipulations.
The model often provides syntactically correct SQL but fails to capture the nuanced logic or efficiency needed for more challenging problems.

Failure modes observed

Common failure modes:

Incorrect application of window functions for complex aggregation or sequential processing.
Logical errors in complex join conditions or subqueries, leading to incorrect filtering or data duplication.
Misinterpretation of problem constraints or specific data handling requirements, such as unique constraints in ON CONFLICT clauses or detailed date/time manipulations.
Some solutions were overly complex or inefficient where simpler, more direct SQL approaches would have sufficed.

Secondary metrics

Readability score: 0.0
Toxicity score: 0.000
Ethics score: 0.000

Run this evaluation yourself

Stratix evaluates Claude Opus 4.6 continuously across 11+ benchmarks. To replicate this BIRD-CRITIC evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.

Source: Stratix evaluation 69863670d5968947e86ac1d4. Updated 2026-02-06.

‹ GPT-5 on Terminal-Bench (Terminus-1): 33.8% accuracy

Claude Opus 4.7 on BIRD-CRITIC: 36.3% accuracy ›