
GPT-5 (high) on SWE-bench Lite (SWE-agent): 51.7% accuracy
Author:
The LayerLens Team
Last updated:
Published:
The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.
Summary
GPT-5 (high) from OpenAI scored 51.7 on SWE-bench Lite (SWE-agent), placing it top 10 (rank 6 of 45) on this benchmark. This places the model in the below-frontier band for SWE-bench Lite (SWE-agent). Acceptable for cost-sensitive workloads or as part of a multi-model ensemble. Not a default choice for high-stakes routing.
Model details
Provider: OpenAI
Model key:
openai/gpt-5-highContext length: 400,000 tokens
License: Proprietary
Open weights: no
Benchmark methodology
Benchmark goal: The benchmark aims to evaluate the capabilities of Large Language Models (LLMs) across a diverse range of tasks, particularly focusing on their proficiency in specific domains and tasks where general-purpose LLMs might struggle.
Scoring metrics:
Accuracy: The proportion of questions for which the model provides the correct answer, typically expressed as a percentage.
Analysis
Key takeaways:
The model performs well on tasks requiring direct code modification and logical bug fixes within a single component.
It struggles with tasks that require a deeper understanding of system-wide implications, object lifecycle, or complex inter-component interactions.
Improvements are needed in handling diverse data types, especially when explicit type conversions or specific evaluation contexts are involved.
The model shows potential for feature implementation but needs refinement in anticipating and testing the full scope of requested changes.
Failure modes observed
Common failure modes:
Misinterpretation of problem context leading to partial or incorrect fixes.
Failure to fully trace side effects of code changes across interconnected systems.
Inadequate handling of edge cases or specific data types.
Difficulty in applying global logical changes that account for different states or configurations.
Incorrect understanding of internal object representations and how they are handled by printing or serialization functions.
Example: In the task involving HttpResponse and memoryview objects, the model initially failed to convert bytearray objects correctly, leading to incorrect byte representations instead of expected content.
Secondary metrics
Readability score: 14.9
Toxicity score: 0.002
Ethics score: 0.000
Run this evaluation yourself
Stratix evaluates GPT-5 (high) continuously across 11+ benchmarks. To replicate this SWE-bench Lite (SWE-agent) evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.
_Source: Stratix evaluation 6912046c61ef3abf3656c486. Updated 2025-11-11._