GPT-5 (high) on SWE-bench Lite (SWE-agent): 51.7% accuracy

Author:

The LayerLens Team

Last updated:

May 13, 2026

Published:

Mar 17, 2026

The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.

Summary

GPT-5 (high) from OpenAI scored 51.7 on SWE-bench Lite (SWE-agent), placing it top 10 (rank 6 of 45) on this benchmark. This places the model in the below-frontier band for SWE-bench Lite (SWE-agent). Acceptable for cost-sensitive workloads or as part of a multi-model ensemble. Not a default choice for high-stakes routing.

Model details

Provider: OpenAI
Model key: openai/gpt-5-high
Context length: 400,000 tokens
License: Proprietary
Open weights: no

Benchmark methodology

Benchmark goal: The benchmark aims to evaluate the capabilities of Large Language Models (LLMs) across a diverse range of tasks, particularly focusing on their proficiency in specific domains and tasks where general-purpose LLMs might struggle.

Scoring metrics:

Accuracy: The proportion of questions for which the model provides the correct answer, typically expressed as a percentage.

Analysis

Key takeaways:

The model performs well on tasks requiring direct code modification and logical bug fixes within a single component.
It struggles with tasks that require a deeper understanding of system-wide implications, object lifecycle, or complex inter-component interactions.
Improvements are needed in handling diverse data types, especially when explicit type conversions or specific evaluation contexts are involved.
The model shows potential for feature implementation but needs refinement in anticipating and testing the full scope of requested changes.

Failure modes observed

Common failure modes:

Misinterpretation of problem context leading to partial or incorrect fixes.
Failure to fully trace side effects of code changes across interconnected systems.
Inadequate handling of edge cases or specific data types.
Difficulty in applying global logical changes that account for different states or configurations.
Incorrect understanding of internal object representations and how they are handled by printing or serialization functions.

Example: In the task involving HttpResponse and memoryview objects, the model initially failed to convert bytearray objects correctly, leading to incorrect byte representations instead of expected content.

Secondary metrics

Readability score: 14.9
Toxicity score: 0.002
Ethics score: 0.000

Run this evaluation yourself

Stratix evaluates GPT-5 (high) continuously across 11+ benchmarks. To replicate this SWE-bench Lite (SWE-agent) evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.

_Source: Stratix evaluation 6912046c61ef3abf3656c486. Updated 2025-11-11._

‹ Claude Opus 4.5 on SWE-bench Lite (SWE-agent): 49.3% accuracy

Kimi K2.6 on AIME 2025: 56.7% accuracy ›