Claude Opus 4.5 on LiveCodeBench: 76.8% accuracy

Author:

The LayerLens Team

Last updated:

Published:

The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.

Summary

Claude Opus 4.5 from Anthropic scored 76.8 on LiveCodeBench, placing it third of 43 on this benchmark. This places the model in the high-tier band for LiveCodeBench. Production-deployable on this benchmark family with margin for prompt and judge variance.

Model details

  • Provider: Anthropic

  • Model key: anthropic/claude-opus-4.5

  • Context length: 200,000 tokens

  • License: Proprietary

  • Open weights: no

Benchmark methodology

Benchmark goal: Evaluate the LLM's ability to solve competitive programming problems, requiring logical reasoning, algorithm design, and coding skills.

Scoring metrics:

  • Pass Rate: Percentage of problems for which the LLM generated code that passed all test cases.

Analysis

Key takeaways:

  • Claude Opus 4.5 exhibits strong performance on problems that can be solved with straightforward greedy, dynamic programming, or simulation approaches without overly complex edge-case interactions.

  • The model struggles with problems requiring highly optimized algorithms for very large constraints (e.g., N > 10^5) or those requiring deep theoretical insights.

  • Its ability to self-correct during problem-solving is evident in many successful completions, but the thought process sometimes goes astray when encountering novel or highly abstract problem types.

  • The model's performance indicates a need for improved complex combinatorial and number-theoretic reasoning.

Failure modes observed

Common failure modes:

  • Misinterpretation of complex logical conditions.

  • Suboptimal algorithmic choice for problems with large constraints, leading to Time Limit Exceeded (TLE) or incorrect logic for specific edge cases.

  • Incorrect derivation of mathematical properties or recurrence relations for combinatorial problems.

  • Failure to account for all permutations or symmetries in counting problems.

  • Incorrectly applying or deriving properties of number theory.

Example: In solve(N, M, C, K, A): The model's approach to calculate the modulo sum involves a period-based computation. While conceptually sound, it misapplies the logic when K is very large compared to the period, leading to an incorrect result.

Secondary metrics

  • Readability score: 0.0

  • Toxicity score: 0.000

  • Ethics score: 0.000

Run this evaluation yourself

Stratix evaluates Claude Opus 4.5 continuously across 11+ benchmarks. To replicate this LiveCodeBench evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.

Source: Stratix evaluation 69259eb39171c00f3f0768e9. Updated 2025-11-25.