
Gemini 3.1 Flash Lite Preview on Terminal-Bench (Terminus-1): 17.5% accuracy
Author:
The LayerLens Team
Last updated:
Published:
The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.
Summary
Gemini 3.1 Flash Lite Preview from Google scored 17.5 on Terminal-Bench (Terminus-1), placing it rank 52 of 77 on this benchmark. This places the model in the weak band for Terminal-Bench (Terminus-1). Below the threshold for production reliance on this benchmark family. Consider only for narrow, fully-tested tasks.
Model details
Provider: Google
Model key:
google/gemini-3.1-flash-lite-previewContext length: 1,048,576 tokens
License: Proprietary
Open weights: no
Benchmark methodology
Benchmark goal: Evaluate the model's ability to execute complex, multi-step instructions across various technical domains, including system configuration, code debugging, data processing, and security tasks. The benchmark assesses accurate command execution, logical problem-solving, and adherence to specified constraints.
Scoring metrics:
Accuracy: The proportion of questions for which the model's prediction is correct. For each question, 1 point is given for a correct answer, and 0 for an incorrect one. Accuracy = (Correct Answers / Total Questions).
Analysis
Key takeaways:
The model exhibits very low accuracy (15.38%) on complex, multi-step technical benchmarks.
It performs adequately on simple, atomic tasks but fails on challenges requiring deeper reasoning, planning, and dynamic interaction.
Key areas of weakness include system-level programming, network configuration, debugging complex dependencies, and stateful problem-solving.
Improvements are needed in understanding and executing multi-stage instructions, error recovery, and adapting to interactive environments to achieve higher performance on such benchmarks.
Failure modes observed
Common failure modes:
Misinterpreting complex instructions or missing implicit constraints.
Failing to correctly install or configure dependencies required for a task.
Generating code that is syntactically correct but functionally flawed for the given problem.
Lack of iterative debugging capability in dynamic environments, leading to single-shot failures.
Inability to adapt to feedback from command execution and correct subsequent steps.
Producing empty or incorrect output for file-based tasks.
Example: In tasks like 'You are placed in a blind maze exploration challenge', the model typically fails to develop a systematic exploration strategy, track its position, or accurately map the maze. Instead of producing a functional solution, it often outputs boilerplate code or steps that do not lead to exploration or mapping, indicating a lack of dynamic reasoning and state management.
Secondary metrics
Readability score: 0.0
Toxicity score: 0.000
Ethics score: 0.000
Run this evaluation yourself
Stratix evaluates Gemini 3.1 Flash Lite Preview continuously across 11+ benchmarks. To replicate this Terminal-Bench (Terminus-1) evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.
Source: Stratix evaluation 69a8a478e4df9545a81fbc6e. Updated 2026-03-04.