
Gemini 3.1 Pro Preview on Terminal-Bench (Terminus-1): 32.5% accuracy
Author:
The LayerLens Team
Last updated:
Published:
The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.
Summary
Gemini 3.1 Pro Preview from Google scored 32.5 on Terminal-Bench (Terminus-1), placing it top 50 (rank 27 of 77) on this benchmark. This places the model in the weak band for Terminal-Bench (Terminus-1). Below the threshold for production reliance on this benchmark family. Consider only for narrow, fully-tested tasks.
Model details
Provider: Google
Model key:
google/gemini-3.1-pro-previewContext length: 1,048,576 tokens
License: Proprietary
Open weights: no
Benchmark methodology
Benchmark goal: To evaluate the ability of the LLM to successfully complete a variety of programming and reasoning tasks, ranging from code generation and debugging to system configuration and complex problem-solving.
Scoring metrics:
score: Binary metric indicating success (1) or failure (0) for each prompt.
Analysis
Key takeaways:
The Gemini 3.1 Pro Preview model demonstrates foundational coding and debugging abilities, particularly when issues are localized and well-defined.
Its performance significantly degrades with increased task complexity, especially those involving multi-step system configurations, external service interactions, and nuanced security protocols.
The model shows a clear weakness in understanding and implementing solutions for network-related tasks, blockchain interactions, and specific data science analytical requirements.
Tasks requiring iterative exploration and mapping (maze challenges) were generally successful, suggesting strength in certain algorithmic problem-solving when explicit interaction mechanisms are provided.
Overall, the model has a pass rate of approximately 38.5% on this benchmark, indicating considerable room for improvement in handling complex, real-world development and operational tasks.
Failure modes observed
Common failure modes:
Inability to synthesize correct commands for complex multi-tool workflows (e.g., QEMU, OpenSSL, Git hooks).
Failing to correctly interpret or execute security-related tasks, including encryption, secure deletion, and password cracking.
Struggles with tasks requiring understanding and interaction with live network services or APIs (e.g., Bitcoin/Solana service, Nginx advanced logging).
Errors in data processing tasks requiring specific statistical calculations or complex data transformations (e.g., Raman peak fitting, token counting, CSV to Parquet).
Example: One notable failure is in the 'Secure Jupyter Notebook server setup' prompt, where the model consistently failed to correctly configure the server with SSL, password authentication, and the sample notebook. This indicates a difficulty in orchestrating multiple system-level configurations and tool usages (Jupyter, OpenSSL) to meet detailed requirements.
Secondary metrics
Readability score: 0.0
Toxicity score: 0.000
Ethics score: 0.000
Run this evaluation yourself
Stratix evaluates Gemini 3.1 Pro Preview continuously across 11+ benchmarks. To replicate this Terminal-Bench (Terminus-1) evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.
_Source: Stratix evaluation 69974350aedac45ce988d49d. Updated 2026-02-19._