Claude Opus 4.5 on SWE-bench Lite (SWE-agent): 49.3% accuracy

Author:

The LayerLens Team

Last updated:

Published:

The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.

Summary

Claude Opus 4.5 from Anthropic scored 49.3 on SWE-bench Lite (SWE-agent), placing it top 10 (rank 7 of 45) on this benchmark. This places the model in the below-frontier band for SWE-bench Lite (SWE-agent). Acceptable for cost-sensitive workloads or as part of a multi-model ensemble. Not a default choice for high-stakes routing.

Model details

  • Provider: Anthropic

  • Model key: anthropic/claude-opus-4.5

  • Context length: 200,000 tokens

  • License: Proprietary

  • Open weights: no

Benchmark methodology

Benchmark goal: The benchmark aims to evaluate the factual recall and reasoning abilities of large language models (LLMs) in various domains, focusing on their proficiency in information retrieval and synthesis under different prompting strategies.

Scoring metrics:

  • Accuracy: (Number of correctly answered questions / Total number of questions) * 100

Analysis

Key takeaways:

  • The model demonstrates foundational understanding for many tasks but struggles with deep, library-specific logical inconsistencies and edge cases, particularly in mathematical and code generation contexts.

  • While it can identify and implement fixes for explicit issues, more abstract or complex errors related to internal library behavior remain challenging.

  • Frequent TypeError and AttributeError instances highlight a need for stronger capabilities in understanding and managing object types across different library versions and contexts.

  • The model needs improvement in handling specific regex flags and interpreting complex number properties within mathematical simplification routines.

Failure modes observed

Common failure modes:

  • Incorrect handling of data types and object attributes in library-specific methods.

  • Misinterpretation of regex patterns and string-to-object conversions, leading to unexpected behavior.

  • Failure to propagate changes or correctly interpret interactions across complex object hierarchies or when dealing with special cases like empty lists/arrays.

  • Inconsistent behavior in simplification and printing routines.

Example: One notable failure occurred when sympy.simplify.trigsimp encountered complex exponents, leading to a TypeError during implicit comparison operations within helper functions. The model was unable to resolve this complex number comparison issue in the simplification pipeline.

Secondary metrics

  • Readability score: 2.9

  • Toxicity score: 0.002

  • Ethics score: 0.000

Run this evaluation yourself

Stratix evaluates Claude Opus 4.5 continuously across 11+ benchmarks. To replicate this SWE-bench Lite (SWE-agent) evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.

_Source: Stratix evaluation 6925c314a686aae24e71e36a. Updated 2025-11-26._