
Gemini 3.1 Flash Lite Preview on Humanity's Last Exam: 8.5% accuracy
Author:
The LayerLens Team
Last updated:
Published:
The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.
Summary
Gemini 3.1 Flash Lite Preview from Google scored 8.5 on Humanity's Last Exam, placing it top 50 (rank 34 of 97) on this benchmark. This places the model in the weak band for Humanity's Last Exam. Below the threshold for production reliance on this benchmark family. Consider only for narrow, fully-tested tasks.
Model details
Provider: Google
Model key:
google/gemini-3.1-flash-lite-previewContext length: 1,048,576 tokens
License: Proprietary
Open weights: no
Benchmark methodology
Benchmark goal: The benchmark evaluates the factual accuracy, reasoning capabilities, and ability to follow instructions of large language models when generating text in response to prompts related to security topics.
Analysis
Key takeaways:
Gemini 3.1 Flash Lite Preview exhibits strong explanatory capabilities but falls short on factual accuracy and precise problem-solving across diverse scientific domains.
The model's confidence scores do not reliably correlate with correctness, necessitating external verification for critical applications.
Significant improvements are needed in its ability to follow detailed instructions and perform accurate multi-step calculations.
Failure modes observed
Common failure modes:
Factual errors and hallucinations despite confident explanations.
Misinterpretation of complex problem constraints or scenarios.
Inability to perform accurate numerical calculations when multiple parameters are involved.
Secondary metrics
Readability score: 0.0
Toxicity score: 0.000
Ethics score: 0.000
Run this evaluation yourself
Stratix evaluates Gemini 3.1 Flash Lite Preview continuously across 11+ benchmarks. To replicate this Humanity's Last Exam evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.
_Source: Stratix evaluation 69a8a473e4df9545a81fbc6b. Updated 2026-03-04._