
Claude Opus 4.7 on BIRD-CRITIC: 36.3% accuracy
Author:
The LayerLens Team
Last updated:
Published:
The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.
Summary
Claude Opus 4.7 from Anthropic scored 36.3 on BIRD-CRITIC, placing it first of 25 models evaluated on this benchmark. This places the model in the weak band for BIRD-CRITIC. Below the threshold for production reliance on this benchmark family. Consider only for narrow, fully-tested tasks.
Model details
Provider: Anthropic
Model key:
anthropic/claude-opus-4.7Context length: 1,000,000 tokens
License: Proprietary
Open weights: no
Benchmark methodology
Benchmark goal: The benchmark is designed to evaluate text-to-SQL capabilities, specifically focusing on a wide range of SQL operations and user issues.
Scoring metrics:
score: Binary metric: 1 if the generated SQL is correct, 0 otherwise.
Analysis
Key takeaways:
Claude Opus 4.7 demonstrates a reasonable understanding of complex SQL generation but falls short in nuanced conflict resolution and advanced dynamic SQL scenarios.
The model excels when explicit SQL constructs are clearly derivable from the request structure but struggles when implicit logic or state management within SQL is required.
Further improvements are needed in handling intricate edge cases of ON CONFLICT clauses, dynamic schema interactions, and precise temporal calculations.
Failure modes observed
Common failure modes:
Incorrectly applying ON CONFLICT logic, especially with DO UPDATE or RETURNING clauses, leading to syntax errors or unexpected behavior.
Issues with nested JSONB queries, particularly when filtering based on array elements or when requiring complex transformations.
Problems with advanced window functions, such as first_value for forward fill or ROW_NUMBER in complex ranking/grouping scenarios.
Errors in date and time manipulation, particularly when calculating differences while excluding specific days or handling arbitrary time intervals.
Misinterpretation of problem constraints, leading to queries that do not fully address the user's intent or produce suboptimal results.
Secondary metrics
Readability score: 0.0
Toxicity score: 0.000
Ethics score: 0.000
Run this evaluation yourself
Stratix evaluates Claude Opus 4.7 continuously across 11+ benchmarks. To replicate this BIRD-CRITIC evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.
Source: Stratix evaluation 69e8df98706ad159e41c4948. Updated 2026-04-22.