
Claude Opus 4.5 on LiveCodeBench: 76.8% accuracy
Author:
The LayerLens Team
Last updated:
Published:
The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.
Summary
Claude Opus 4.5 from Anthropic scored 76.8 on LiveCodeBench, placing it third of 43 on this benchmark. This places the model in the high-tier band for LiveCodeBench. Production-deployable on this benchmark family with margin for prompt and judge variance.
Model details
Provider: Anthropic
Model key:
anthropic/claude-opus-4.5Context length: 200,000 tokens
License: Proprietary
Open weights: no
Benchmark methodology
Benchmark goal: Evaluate the LLM's ability to solve competitive programming problems, requiring logical reasoning, algorithm design, and coding skills.
Scoring metrics:
Pass Rate: Percentage of problems for which the LLM generated code that passed all test cases.
Analysis
Key takeaways:
Claude Opus 4.5 exhibits strong performance on problems that can be solved with straightforward greedy, dynamic programming, or simulation approaches without overly complex edge-case interactions.
The model struggles with problems requiring highly optimized algorithms for very large constraints (e.g., N > 10^5) or those requiring deep theoretical insights.
Its ability to self-correct during problem-solving is evident in many successful completions, but the thought process sometimes goes astray when encountering novel or highly abstract problem types.
The model's performance indicates a need for improved complex combinatorial and number-theoretic reasoning.
Failure modes observed
Common failure modes:
Misinterpretation of complex logical conditions.
Suboptimal algorithmic choice for problems with large constraints, leading to Time Limit Exceeded (TLE) or incorrect logic for specific edge cases.
Incorrect derivation of mathematical properties or recurrence relations for combinatorial problems.
Failure to account for all permutations or symmetries in counting problems.
Incorrectly applying or deriving properties of number theory.
Example: In solve(N, M, C, K, A): The model's approach to calculate the modulo sum involves a period-based computation. While conceptually sound, it misapplies the logic when K is very large compared to the period, leading to an incorrect result.
Secondary metrics
Readability score: 0.0
Toxicity score: 0.000
Ethics score: 0.000
Run this evaluation yourself
Stratix evaluates Claude Opus 4.5 continuously across 11+ benchmarks. To replicate this LiveCodeBench evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.
Source: Stratix evaluation 69259eb39171c00f3f0768e9. Updated 2025-11-25.