Gemini 3.1 Flash Lite Preview on Terminal-Bench (Terminus-2): 17.5% accuracy

Author:

The LayerLens Team

Last updated:

May 13, 2026

Published:

Apr 22, 2026

The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.

Summary

Gemini 3.1 Flash Lite Preview from Google scored 17.5 on Terminal-Bench (Terminus-2), placing it top 25 (rank 19 of 21) on this benchmark. This places the model in the weak band for Terminal-Bench (Terminus-2). Below the threshold for production reliance on this benchmark family. Consider only for narrow, fully-tested tasks.

Model details

Provider: Google
Model key: google/gemini-3.1-flash-lite-preview
Context length: 1,048,576 tokens
License: Proprietary
Open weights: no

Benchmark methodology

Benchmark goal: To evaluate the model's ability to understand and execute complex coding and system administration tasks, debug issues, and generate accurate solutions across various domains.

Scoring metrics:

Score: Binary score (1 for successful completion, 0 for failure) for each task.

Analysis

Key takeaways:

The Gemini 3.1 Flash Lite Preview demonstrates limited capability in complex, multi-step technical problem-solving and code generation tasks.
Its performance is notably better on diagnostic or single-step code modification tasks than on proactive system setup or algorithmic implementations.
Significant improvements are needed in understanding and interacting with operating system environments, debugging complex codebases, and generating robust, functional solutions.

Failure modes observed

Common failure modes:

Inability to generate executable code that achieves the desired outcome.
Lack of understanding of shell environments and command execution flow, leading to incorrect or incomplete commands.
Failure to correctly debug and fix provided code snippets, often missing the core issue or proposing ineffective solutions.
Issues with file system interactions, such as creating files in specified locations or with specific content.
Poor performance on tasks requiring continuous interaction or state tracking, like the maze exploration tasks.

Example: One recurring failure is seen in tasks involving system configuration (e.g., Git server, Nginx, Jupyter setup). The model often provides plausible-looking commands but they either contain subtle errors, are not executed in the correct sequence, or fail to account for environmental specifics, leading to non-functional setups.

Secondary metrics

Readability score: 0.0
Toxicity score: 0.000
Ethics score: 0.000

Run this evaluation yourself

Stratix evaluates Gemini 3.1 Flash Lite Preview continuously across 11+ benchmarks. To replicate this Terminal-Bench (Terminus-2) evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.

Source: Stratix evaluation 69a743435a24fc3a34525ba5. Updated 2026-03-03.

‹ Claude Opus 4.5 on Humanity's Last Exam: 13.6% accuracy

Gemini 3.1 Flash Lite Preview on Terminal-Bench (Terminus-1): 17.5% accuracy ›