Gemini 3.1 Flash Lite Preview on Humanity's Last Exam: 8.5% accuracy

Author:

The LayerLens Team

Last updated:

May 13, 2026

Published:

Apr 29, 2026

The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.

Summary

Gemini 3.1 Flash Lite Preview from Google scored 8.5 on Humanity's Last Exam, placing it top 50 (rank 34 of 97) on this benchmark. This places the model in the weak band for Humanity's Last Exam. Below the threshold for production reliance on this benchmark family. Consider only for narrow, fully-tested tasks.

Model details

Provider: Google
Model key: google/gemini-3.1-flash-lite-preview
Context length: 1,048,576 tokens
License: Proprietary
Open weights: no

Benchmark methodology

Benchmark goal: The benchmark evaluates the factual accuracy, reasoning capabilities, and ability to follow instructions of large language models when generating text in response to prompts related to security topics.

Analysis

Key takeaways:

Gemini 3.1 Flash Lite Preview exhibits strong explanatory capabilities but falls short on factual accuracy and precise problem-solving across diverse scientific domains.
The model's confidence scores do not reliably correlate with correctness, necessitating external verification for critical applications.
Significant improvements are needed in its ability to follow detailed instructions and perform accurate multi-step calculations.

Failure modes observed

Common failure modes:

Factual errors and hallucinations despite confident explanations.
Misinterpretation of complex problem constraints or scenarios.
Inability to perform accurate numerical calculations when multiple parameters are involved.

Secondary metrics

Readability score: 0.0
Toxicity score: 0.000
Ethics score: 0.000

Run this evaluation yourself

Stratix evaluates Gemini 3.1 Flash Lite Preview continuously across 11+ benchmarks. To replicate this Humanity's Last Exam evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.

_Source: Stratix evaluation 69a8a473e4df9545a81fbc6b. Updated 2026-03-04._

‹ Llama 4 Maverick on SWE-bench Lite (SWE-agent): 8.0% accuracy

Llama 4 Scout on Terminal-Bench (Terminus-1): 8.8% accuracy ›