
GPT-5 (high) on Terminal-Bench (Terminus-2): 42.5% accuracy
Author:
The LayerLens Team
Last updated:
Published:
The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.
Summary
GPT-5 (high) from OpenAI scored 42.5 on Terminal-Bench (Terminus-2), placing it top 10 (rank 10 of 21) on this benchmark. This places the model in the below-frontier band for Terminal-Bench (Terminus-2). Acceptable for cost-sensitive workloads or as part of a multi-model ensemble. Not a default choice for high-stakes routing.
Model details
Provider: OpenAI
Model key:
openai/gpt-5-highContext length: 400,000 tokens
License: Proprietary
Open weights: no
Benchmark methodology
Benchmark goal: Evaluate the LLM's ability to solve a diverse set of programming and reasoning tasks.
Scoring metrics:
Score: Binary success (1) or failure (0) based on task completion and correctness.
Duration: Time taken to complete the task (in seconds).
Analysis
Key takeaways:
The LLM demonstrates proficiency in completing individual programming and environment setup tasks, but struggles with complex, multi-stage problems.
Planning and Robustness in adapting to unseen dependencies remain significant challenges.
The model shows inconsistent performance in code generation tasks, failing to develop robust algorithms or adapt to environmental constraints.
Failure modes observed
Common failure modes:
Failure to follow complex instructions involving multiple tools and configurations.
Incomplete or incorrect code generation leading to non-functional solutions.
Not able to plan multistep task to arrive at expected goal (e.g. Creating a website profile from info.md)
Not able to reengineer binaries with exact same behavior
Dependency errors: Not able to fix fasttext installation
Example: Failing to build a complete end-to-end system, such as setting up a Git server with deployment or creating a fully functional web scraper. In the Linux Kernel building, the agent fails to follow the custom configuration.
Secondary metrics
Readability score: 0.0
Toxicity score: 0.000
Ethics score: 0.000
Run this evaluation yourself
Stratix evaluates GPT-5 (high) continuously across 11+ benchmarks. To replicate this Terminal-Bench (Terminus-2) evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.
Source: Stratix evaluation 68fff48651dd2ed6199046b2. Updated 2025-10-28.