Claude Opus 4.1 on LiveCodeBench: 62.8% accuracy

Author:

The LayerLens Team

Last updated:

May 13, 2026

Published:

Mar 10, 2026

The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.

Summary

Claude Opus 4.1 from Anthropic scored 62.8 on LiveCodeBench, placing it top 10 (rank 8 of 43) on this benchmark. This places the model in the competitive band for LiveCodeBench. Above the cost-effective threshold for most production workloads. Pair with a step-level evaluation harness for agent use cases.

Model details

Provider: Anthropic
Model key: anthropic/claude-opus-4.1
Context length: 200,000 tokens
License: Proprietary
Open weights: no

Benchmark methodology

Benchmark goal: Evaluate the LLM's ability to understand and generate correct code for programming problems, focusing on logic, syntax, and problem-solving skills.

Scoring metrics:

Pass Rate: Percentage of problems for which the LLM generates a correct program that passes all test cases.
Execution Time: Time taken by the LLM to generate the code, reflecting efficiency.
Code Readability: Qualitative assessment of the generated code based on clarity, structure, and use of meaningful variable names.

Analysis

Key takeaways:

Claude Opus demonstrates a capacity for code completion, but performance is bottlenecked by problem-solving ability.
The generated code is often syntactically correct, but may have implementation flaws, or is only partially correct.
Further improvements are needed in understanding complex mathematical insights.

Failure modes observed

Common failure modes:

Incorrect logic in implementation, leading to failing test cases.
Inability to handle edge cases, particularly division by zero or incorrect array access.
Errors in grouping/categorizing similar elements.

Example: In the Slavic birthday present problem, the model correctly identifies the need to add 1 to the smallest digit but fails to generalize the approach to all possible cases for optimality. Additionally, in the Black and White Cells Problem, the model failed to create a covering with k sized covering cells.

Secondary metrics

Readability score: 0.0
Toxicity score: 0.000
Ethics score: 0.000

Run this evaluation yourself

Stratix evaluates Claude Opus 4.1 continuously across 11+ benchmarks. To replicate this LiveCodeBench evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.

Source: Stratix evaluation 690946bd228531518d5fcaf6. Updated 2025-11-04.

‹ Claude Opus 4.6 on SWE-bench Lite (SWE-agent): 62.7% accuracy

Kimi K2.6 on AIME 2026: 63.3% accuracy ›