Claude Opus 4.6 on SWE-bench Lite (SWE-agent): 62.7% accuracy

Author:

The LayerLens Team

Last updated:

May 13, 2026

Published:

Mar 12, 2026

The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.

Summary

Claude Opus 4.6 from Anthropic scored 62.7 on SWE-bench Lite (SWE-agent), placing it first of 45 models evaluated on this benchmark. This places the model in the competitive band for SWE-bench Lite (SWE-agent). Above the cost-effective threshold for most production workloads. Pair with a step-level evaluation harness for agent use cases.

Model details

Provider: Anthropic
Model key: anthropic/claude-opus-4.6
Context length: 200,000 tokens
License: Proprietary
Open weights: no

Benchmark methodology

Benchmark goal: To evaluate the model's ability to identify and resolve bugs in existing codebases, understand code snippets, implement new features, and handle edge cases, focusing on Python libraries like AstroPy, Django, and SymPy.

Scoring metrics:

Pass/Fail Score: Binary metric indicating whether the model successfully resolved the issue (1 for pass, 0 for fail).
Readability: Metric of the clarity, conciseness, and comprehensibility of the generated code/text.
Toxicity: Metric measuring the level of toxic content in the generated text.

Analysis

Key takeaways:

The model demonstrates a solid understanding of bug reproduction and applying surgical fixes for specific error scenarios, achieving a commendable pass rate on individual bug fixing tasks.
However, it struggles with tasks that require a deeper inferential leap or a more comprehensive understanding of the underlying symbolic engine's behavior.
Performance indicates a strength in straightforward bug fixes and minor feature implementations, but a notable weakness in tasks involving subtle symbolic logic, edge-case management for data structures, and ensuring complete consistency across various library components.

Failure modes observed

Common failure modes:

Misinterpreting internal library behavior, especially concerning evaluation rules and object types.
Failure to correctly apply type and condition checks where symbolic evaluation might lead to ambiguous or unhandled cases.
Incorrectly propagating assumptions or lack of deep contextual understanding in complex symbolic manipulation.
Issues related to object equality and hashability leading to incorrect filtering or processing of collections.
Regressions in handling specific input types or edge cases previously managed.

Example: In the partitions() reusing output dictionaries issue, the model generated a fix that addressed dictionary copying, which was a core problem. However, its other proposed solution failed because it incorrectly implemented the logic. These issues highlight the model's varying success in implementing new features based on whether the original output structure needs to be maintained.

Secondary metrics

Readability score: 10.0
Toxicity score: 0.003
Ethics score: 0.000

Run this evaluation yourself

Stratix evaluates Claude Opus 4.6 continuously across 11+ benchmarks. To replicate this SWE-bench Lite (SWE-agent) evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.

Source: Stratix evaluation 6984fce05a32e67148f2f6d1. Updated 2026-02-06.

‹ Claude Opus 4.6 on Terminal-Bench (Terminus-2): 58.8% accuracy

Claude Opus 4.1 on LiveCodeBench: 62.8% accuracy ›