GPT-5 on SWE-bench Lite (SWE-agent): 47.3% accuracy

Author:

The LayerLens Team

Last updated:

Published:

The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.

Summary

GPT-5 from OpenAI scored 47.3 on SWE-bench Lite (SWE-agent), placing it top 10 (rank 8 of 45) on this benchmark. This places the model in the below-frontier band for SWE-bench Lite (SWE-agent). Acceptable for cost-sensitive workloads or as part of a multi-model ensemble. Not a default choice for high-stakes routing.

Model details

  • Provider: OpenAI

  • Model key: openai/gpt-5

  • Context length: 400,000 tokens

  • License: Proprietary

  • Open weights: no

Benchmark methodology

Benchmark goal: Assess the performance of the LLM (GPT-5) in identifying, understanding, and fixing bugs in the SymPy library, a Python library for symbolic mathematics.

Scoring metrics:

  • Score: Binary metric (0 or 1) indicating whether the LLM successfully fixed the bug. A score of 1 indicates complete success, 0 indicates failure.

  • Duration: Time taken by the LLM to complete the task, measured in seconds.

  • Toxicity: A measurement of the toxicity of prompt generations.

  • Readability: Calculated using Flesch Reading Ease.

  • Hallucination: This is an uncalculated field for now, but is planned to be calculated in the future.

Analysis

Key takeaways:

  • GPT-5 demonstrates effectiveness in specific SymPy code problems. It makes an excellent code generator at file level but not code editor.

  • GPT-5 faces major reliability issues involving larger problems that involve a file edit tool.

  • Identified weaknesses point to areas for improvement involving better understanding file paths and tool parameters.

Failure modes observed

Common failure modes:

  • Incorrect path specifications when editing files, leading to failure of the applied action.

  • Inability to handle directory based operations and missing view ranges when modifying files.

  • Logic errors in code generation, like incorrect replacement patterns in '/testbed/astropy/modeling/separable.py'.

  • Failed to maintain file writes when they are outside its immediate operations scope in the prompt.

Example: In subset 'default' problem 'Modeling's separability_matrix does not compute separability correctly for nested CompoundModels', GPT-5 fails because of argument errors and misconfigurations in '/testbed/astropy/modeling/separable.py'.

Secondary metrics

  • Readability score: 21.3

  • Toxicity score: 0.001

  • Ethics score: 0.000

Run this evaluation yourself

Stratix evaluates GPT-5 continuously across 11+ benchmarks. To replicate this SWE-bench Lite (SWE-agent) evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.

_Source: Stratix evaluation 6903d3a6448b49fea170d5d8. Updated 2025-10-31._