GPT-5 (high) on SWE-bench Lite (SWE-agent): 51.7% accuracy

Author:

The LayerLens Team

Last updated:

Published:

The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.

Summary

GPT-5 (high) from OpenAI scored 51.7 on SWE-bench Lite (SWE-agent), placing it top 10 (rank 6 of 45) on this benchmark. This places the model in the below-frontier band for SWE-bench Lite (SWE-agent). Acceptable for cost-sensitive workloads or as part of a multi-model ensemble. Not a default choice for high-stakes routing.

Model details

  • Provider: OpenAI

  • Model key: openai/gpt-5-high

  • Context length: 400,000 tokens

  • License: Proprietary

  • Open weights: no

Benchmark methodology

Benchmark goal: The benchmark aims to evaluate the capabilities of Large Language Models (LLMs) across a diverse range of tasks, particularly focusing on their proficiency in specific domains and tasks where general-purpose LLMs might struggle.

Scoring metrics:

  • Accuracy: The proportion of questions for which the model provides the correct answer, typically expressed as a percentage.

Analysis

Key takeaways:

  • The model performs well on tasks requiring direct code modification and logical bug fixes within a single component.

  • It struggles with tasks that require a deeper understanding of system-wide implications, object lifecycle, or complex inter-component interactions.

  • Improvements are needed in handling diverse data types, especially when explicit type conversions or specific evaluation contexts are involved.

  • The model shows potential for feature implementation but needs refinement in anticipating and testing the full scope of requested changes.

Failure modes observed

Common failure modes:

  • Misinterpretation of problem context leading to partial or incorrect fixes.

  • Failure to fully trace side effects of code changes across interconnected systems.

  • Inadequate handling of edge cases or specific data types.

  • Difficulty in applying global logical changes that account for different states or configurations.

  • Incorrect understanding of internal object representations and how they are handled by printing or serialization functions.

Example: In the task involving HttpResponse and memoryview objects, the model initially failed to convert bytearray objects correctly, leading to incorrect byte representations instead of expected content.

Secondary metrics

  • Readability score: 14.9

  • Toxicity score: 0.002

  • Ethics score: 0.000

Run this evaluation yourself

Stratix evaluates GPT-5 (high) continuously across 11+ benchmarks. To replicate this SWE-bench Lite (SWE-agent) evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.

_Source: Stratix evaluation 6912046c61ef3abf3656c486. Updated 2025-11-11._