Gemini 3.1 Flash Lite Preview on Terminal-Bench (Terminus-1): 17.5% accuracy

Author:

The LayerLens Team

Last updated:

Published:

The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.

Summary

Gemini 3.1 Flash Lite Preview from Google scored 17.5 on Terminal-Bench (Terminus-1), placing it rank 52 of 77 on this benchmark. This places the model in the weak band for Terminal-Bench (Terminus-1). Below the threshold for production reliance on this benchmark family. Consider only for narrow, fully-tested tasks.

Model details

  • Provider: Google

  • Model key: google/gemini-3.1-flash-lite-preview

  • Context length: 1,048,576 tokens

  • License: Proprietary

  • Open weights: no

Benchmark methodology

Benchmark goal: Evaluate the model's ability to execute complex, multi-step instructions across various technical domains, including system configuration, code debugging, data processing, and security tasks. The benchmark assesses accurate command execution, logical problem-solving, and adherence to specified constraints.

Scoring metrics:

  • Accuracy: The proportion of questions for which the model's prediction is correct. For each question, 1 point is given for a correct answer, and 0 for an incorrect one. Accuracy = (Correct Answers / Total Questions).

Analysis

Key takeaways:

  • The model exhibits very low accuracy (15.38%) on complex, multi-step technical benchmarks.

  • It performs adequately on simple, atomic tasks but fails on challenges requiring deeper reasoning, planning, and dynamic interaction.

  • Key areas of weakness include system-level programming, network configuration, debugging complex dependencies, and stateful problem-solving.

  • Improvements are needed in understanding and executing multi-stage instructions, error recovery, and adapting to interactive environments to achieve higher performance on such benchmarks.

Failure modes observed

Common failure modes:

  • Misinterpreting complex instructions or missing implicit constraints.

  • Failing to correctly install or configure dependencies required for a task.

  • Generating code that is syntactically correct but functionally flawed for the given problem.

  • Lack of iterative debugging capability in dynamic environments, leading to single-shot failures.

  • Inability to adapt to feedback from command execution and correct subsequent steps.

  • Producing empty or incorrect output for file-based tasks.

Example: In tasks like 'You are placed in a blind maze exploration challenge', the model typically fails to develop a systematic exploration strategy, track its position, or accurately map the maze. Instead of producing a functional solution, it often outputs boilerplate code or steps that do not lead to exploration or mapping, indicating a lack of dynamic reasoning and state management.

Secondary metrics

  • Readability score: 0.0

  • Toxicity score: 0.000

  • Ethics score: 0.000

Run this evaluation yourself

Stratix evaluates Gemini 3.1 Flash Lite Preview continuously across 11+ benchmarks. To replicate this Terminal-Bench (Terminus-1) evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.

Source: Stratix evaluation 69a8a478e4df9545a81fbc6e. Updated 2026-03-04.