GPT-5 (high) on Terminal-Bench (Terminus-2): 42.5% accuracy

Author:

The LayerLens Team

Last updated:

Published:

The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.

Summary

GPT-5 (high) from OpenAI scored 42.5 on Terminal-Bench (Terminus-2), placing it top 10 (rank 10 of 21) on this benchmark. This places the model in the below-frontier band for Terminal-Bench (Terminus-2). Acceptable for cost-sensitive workloads or as part of a multi-model ensemble. Not a default choice for high-stakes routing.

Model details

  • Provider: OpenAI

  • Model key: openai/gpt-5-high

  • Context length: 400,000 tokens

  • License: Proprietary

  • Open weights: no

Benchmark methodology

Benchmark goal: Evaluate the LLM's ability to solve a diverse set of programming and reasoning tasks.

Scoring metrics:

  • Score: Binary success (1) or failure (0) based on task completion and correctness.

  • Duration: Time taken to complete the task (in seconds).

Analysis

Key takeaways:

  • The LLM demonstrates proficiency in completing individual programming and environment setup tasks, but struggles with complex, multi-stage problems.

  • Planning and Robustness in adapting to unseen dependencies remain significant challenges.

  • The model shows inconsistent performance in code generation tasks, failing to develop robust algorithms or adapt to environmental constraints.

Failure modes observed

Common failure modes:

  • Failure to follow complex instructions involving multiple tools and configurations.

  • Incomplete or incorrect code generation leading to non-functional solutions.

  • Not able to plan multistep task to arrive at expected goal (e.g. Creating a website profile from info.md)

  • Not able to reengineer binaries with exact same behavior

  • Dependency errors: Not able to fix fasttext installation

Example: Failing to build a complete end-to-end system, such as setting up a Git server with deployment or creating a fully functional web scraper. In the Linux Kernel building, the agent fails to follow the custom configuration.

Secondary metrics

  • Readability score: 0.0

  • Toxicity score: 0.000

  • Ethics score: 0.000

Run this evaluation yourself

Stratix evaluates GPT-5 (high) continuously across 11+ benchmarks. To replicate this Terminal-Bench (Terminus-2) evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.

Source: Stratix evaluation 68fff48651dd2ed6199046b2. Updated 2025-10-28.