
GPT-5 on SWE-bench Lite (SWE-agent): 47.3% accuracy
Author:
The LayerLens Team
Last updated:
Published:
The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.
Summary
GPT-5 from OpenAI scored 47.3 on SWE-bench Lite (SWE-agent), placing it top 10 (rank 8 of 45) on this benchmark. This places the model in the below-frontier band for SWE-bench Lite (SWE-agent). Acceptable for cost-sensitive workloads or as part of a multi-model ensemble. Not a default choice for high-stakes routing.
Model details
Provider: OpenAI
Model key:
openai/gpt-5Context length: 400,000 tokens
License: Proprietary
Open weights: no
Benchmark methodology
Benchmark goal: Assess the performance of the LLM (GPT-5) in identifying, understanding, and fixing bugs in the SymPy library, a Python library for symbolic mathematics.
Scoring metrics:
Score: Binary metric (0 or 1) indicating whether the LLM successfully fixed the bug. A score of 1 indicates complete success, 0 indicates failure.
Duration: Time taken by the LLM to complete the task, measured in seconds.
Toxicity: A measurement of the toxicity of prompt generations.
Readability: Calculated using Flesch Reading Ease.
Hallucination: This is an uncalculated field for now, but is planned to be calculated in the future.
Analysis
Key takeaways:
GPT-5 demonstrates effectiveness in specific SymPy code problems. It makes an excellent code generator at file level but not code editor.
GPT-5 faces major reliability issues involving larger problems that involve a file edit tool.
Identified weaknesses point to areas for improvement involving better understanding file paths and tool parameters.
Failure modes observed
Common failure modes:
Incorrect path specifications when editing files, leading to failure of the applied action.
Inability to handle directory based operations and missing view ranges when modifying files.
Logic errors in code generation, like incorrect replacement patterns in '/testbed/astropy/modeling/separable.py'.
Failed to maintain file writes when they are outside its immediate operations scope in the prompt.
Example: In subset 'default' problem 'Modeling's separability_matrix does not compute separability correctly for nested CompoundModels', GPT-5 fails because of argument errors and misconfigurations in '/testbed/astropy/modeling/separable.py'.
Secondary metrics
Readability score: 21.3
Toxicity score: 0.001
Ethics score: 0.000
Run this evaluation yourself
Stratix evaluates GPT-5 continuously across 11+ benchmarks. To replicate this SWE-bench Lite (SWE-agent) evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.
_Source: Stratix evaluation 6903d3a6448b49fea170d5d8. Updated 2025-10-31._