GPT-5 (high) on Terminal-Bench (Terminus-2): 42.5% accuracy

Author:

The LayerLens Team

Last updated:

May 13, 2026

Published:

Mar 25, 2026

The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.

Summary

GPT-5 (high) from OpenAI scored 42.5 on Terminal-Bench (Terminus-2), placing it top 10 (rank 10 of 21) on this benchmark. This places the model in the below-frontier band for Terminal-Bench (Terminus-2). Acceptable for cost-sensitive workloads or as part of a multi-model ensemble. Not a default choice for high-stakes routing.

Model details

Provider: OpenAI
Model key: openai/gpt-5-high
Context length: 400,000 tokens
License: Proprietary
Open weights: no

Benchmark methodology

Benchmark goal: Evaluate the LLM's ability to solve a diverse set of programming and reasoning tasks.

Scoring metrics:

Score: Binary success (1) or failure (0) based on task completion and correctness.
Duration: Time taken to complete the task (in seconds).

Analysis

Key takeaways:

The LLM demonstrates proficiency in completing individual programming and environment setup tasks, but struggles with complex, multi-stage problems.
Planning and Robustness in adapting to unseen dependencies remain significant challenges.
The model shows inconsistent performance in code generation tasks, failing to develop robust algorithms or adapt to environmental constraints.

Failure modes observed

Common failure modes:

Failure to follow complex instructions involving multiple tools and configurations.
Incomplete or incorrect code generation leading to non-functional solutions.
Not able to plan multistep task to arrive at expected goal (e.g. Creating a website profile from info.md)
Not able to reengineer binaries with exact same behavior
Dependency errors: Not able to fix fasttext installation

Example: Failing to build a complete end-to-end system, such as setting up a Git server with deployment or creating a fully functional web scraper. In the Linux Kernel building, the agent fails to follow the custom configuration.

Secondary metrics

Readability score: 0.0
Toxicity score: 0.000
Ethics score: 0.000

Run this evaluation yourself

Stratix evaluates GPT-5 (high) continuously across 11+ benchmarks. To replicate this Terminal-Bench (Terminus-2) evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.

Source: Stratix evaluation 68fff48651dd2ed6199046b2. Updated 2025-10-28.

‹ Gemini 3.1 Pro Preview on Humanity's Last Exam: 40.6% accuracy

Llama 4 Maverick on LiveCodeBench: 45.4% accuracy ›