
Claude Opus 4.1 on LiveCodeBench: 62.8% accuracy
Author:
The LayerLens Team
Last updated:
Published:
The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.
Summary
Claude Opus 4.1 from Anthropic scored 62.8 on LiveCodeBench, placing it top 10 (rank 8 of 43) on this benchmark. This places the model in the competitive band for LiveCodeBench. Above the cost-effective threshold for most production workloads. Pair with a step-level evaluation harness for agent use cases.
Model details
Provider: Anthropic
Model key:
anthropic/claude-opus-4.1Context length: 200,000 tokens
License: Proprietary
Open weights: no
Benchmark methodology
Benchmark goal: Evaluate the LLM's ability to understand and generate correct code for programming problems, focusing on logic, syntax, and problem-solving skills.
Scoring metrics:
Pass Rate: Percentage of problems for which the LLM generates a correct program that passes all test cases.
Execution Time: Time taken by the LLM to generate the code, reflecting efficiency.
Code Readability: Qualitative assessment of the generated code based on clarity, structure, and use of meaningful variable names.
Analysis
Key takeaways:
Claude Opus demonstrates a capacity for code completion, but performance is bottlenecked by problem-solving ability.
The generated code is often syntactically correct, but may have implementation flaws, or is only partially correct.
Further improvements are needed in understanding complex mathematical insights.
Failure modes observed
Common failure modes:
Incorrect logic in implementation, leading to failing test cases.
Inability to handle edge cases, particularly division by zero or incorrect array access.
Errors in grouping/categorizing similar elements.
Example: In the Slavic birthday present problem, the model correctly identifies the need to add 1 to the smallest digit but fails to generalize the approach to all possible cases for optimality. Additionally, in the Black and White Cells Problem, the model failed to create a covering with k sized covering cells.
Secondary metrics
Readability score: 0.0
Toxicity score: 0.000
Ethics score: 0.000
Run this evaluation yourself
Stratix evaluates Claude Opus 4.1 continuously across 11+ benchmarks. To replicate this LiveCodeBench evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.
Source: Stratix evaluation 690946bd228531518d5fcaf6. Updated 2025-11-04.