Claude Opus 4.7 on BIRD-CRITIC: 36.3% accuracy

Author:

The LayerLens Team

Last updated:

Published:

The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.

Summary

Claude Opus 4.7 from Anthropic scored 36.3 on BIRD-CRITIC, placing it first of 25 models evaluated on this benchmark. This places the model in the weak band for BIRD-CRITIC. Below the threshold for production reliance on this benchmark family. Consider only for narrow, fully-tested tasks.

Model details

  • Provider: Anthropic

  • Model key: anthropic/claude-opus-4.7

  • Context length: 1,000,000 tokens

  • License: Proprietary

  • Open weights: no

Benchmark methodology

Benchmark goal: The benchmark is designed to evaluate text-to-SQL capabilities, specifically focusing on a wide range of SQL operations and user issues.

Scoring metrics:

  • score: Binary metric: 1 if the generated SQL is correct, 0 otherwise.

Analysis

Key takeaways:

  • Claude Opus 4.7 demonstrates a reasonable understanding of complex SQL generation but falls short in nuanced conflict resolution and advanced dynamic SQL scenarios.

  • The model excels when explicit SQL constructs are clearly derivable from the request structure but struggles when implicit logic or state management within SQL is required.

  • Further improvements are needed in handling intricate edge cases of ON CONFLICT clauses, dynamic schema interactions, and precise temporal calculations.

Failure modes observed

Common failure modes:

  • Incorrectly applying ON CONFLICT logic, especially with DO UPDATE or RETURNING clauses, leading to syntax errors or unexpected behavior.

  • Issues with nested JSONB queries, particularly when filtering based on array elements or when requiring complex transformations.

  • Problems with advanced window functions, such as first_value for forward fill or ROW_NUMBER in complex ranking/grouping scenarios.

  • Errors in date and time manipulation, particularly when calculating differences while excluding specific days or handling arbitrary time intervals.

  • Misinterpretation of problem constraints, leading to queries that do not fully address the user's intent or produce suboptimal results.

Secondary metrics

  • Readability score: 0.0

  • Toxicity score: 0.000

  • Ethics score: 0.000

Run this evaluation yourself

Stratix evaluates Claude Opus 4.7 continuously across 11+ benchmarks. To replicate this BIRD-CRITIC evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.

Source: Stratix evaluation 69e8df98706ad159e41c4948. Updated 2026-04-22.