Gemini 3.1 Flash Lite Preview on Humanity's Last Exam: 8.5% accuracy

Author:

The LayerLens Team

Last updated:

Published:

The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.

Summary

Gemini 3.1 Flash Lite Preview from Google scored 8.5 on Humanity's Last Exam, placing it top 50 (rank 34 of 97) on this benchmark. This places the model in the weak band for Humanity's Last Exam. Below the threshold for production reliance on this benchmark family. Consider only for narrow, fully-tested tasks.

Model details

  • Provider: Google

  • Model key: google/gemini-3.1-flash-lite-preview

  • Context length: 1,048,576 tokens

  • License: Proprietary

  • Open weights: no

Benchmark methodology

Benchmark goal: The benchmark evaluates the factual accuracy, reasoning capabilities, and ability to follow instructions of large language models when generating text in response to prompts related to security topics.

Analysis

Key takeaways:

  • Gemini 3.1 Flash Lite Preview exhibits strong explanatory capabilities but falls short on factual accuracy and precise problem-solving across diverse scientific domains.

  • The model's confidence scores do not reliably correlate with correctness, necessitating external verification for critical applications.

  • Significant improvements are needed in its ability to follow detailed instructions and perform accurate multi-step calculations.

Failure modes observed

Common failure modes:

  • Factual errors and hallucinations despite confident explanations.

  • Misinterpretation of complex problem constraints or scenarios.

  • Inability to perform accurate numerical calculations when multiple parameters are involved.

Secondary metrics

  • Readability score: 0.0

  • Toxicity score: 0.000

  • Ethics score: 0.000

Run this evaluation yourself

Stratix evaluates Gemini 3.1 Flash Lite Preview continuously across 11+ benchmarks. To replicate this Humanity's Last Exam evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.

_Source: Stratix evaluation 69a8a473e4df9545a81fbc6b. Updated 2026-03-04._