Gemini 3.1 Pro Preview on Humanity's Last Exam: 40.6% accuracy

Author:

The LayerLens Team

Last updated:

Published:

The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.

Summary

Gemini 3.1 Pro Preview from Google scored 40.6 on Humanity's Last Exam, placing it first of 97 models evaluated on this benchmark. This places the model in the below-frontier band for Humanity's Last Exam. Acceptable for cost-sensitive workloads or as part of a multi-model ensemble. Not a default choice for high-stakes routing.

Model details

  • Provider: Google

  • Model key: google/gemini-3.1-pro-preview

  • Context length: 1,048,576 tokens

  • License: Proprietary

  • Open weights: no

Benchmark methodology

Benchmark goal: To evaluate the model's ability to engage in complex reasoning across a diverse set of biomedical and mathematical tasks, requiring deep understanding and precise problem-solving capabilities.

Scoring metrics:

  • Score: Binary (1 for correct, 0 for incorrect) assessment of each answer.

  • Confidence: Model's self-assessed confidence (0-100%) for its answer.

  • Duration: Time taken by the model to generate the response.

Analysis

Key takeaways:

  • The Gemini 3.1 Pro Preview demonstrates advanced capabilities in symbolic reasoning, particularly in complex mathematical domains, but is highly sensitive to the exact wording and numerical precision.

  • The model exhibits a strong understanding of foundational concepts across scientific disciplines, often providing detailed and accurate explanations.

  • The current pass rate indicates a need for enhanced robustness in handling complex constraint satisfaction problems.

  • Further development should focus on improving visual data processing, disambiguating subtle linguistic traps, and ensuring computational accuracy in multi-digit numerical calculations.

Failure modes observed

Common failure modes:

  • Misinterpretation of visual data leading to incorrect classification or identification from images.

  • Typographical/transcription errors in problem formulation, where the model often correctly solves the problem as written but the provided ground truth implicitly assumes a corrected version.

  • Direct numerical calculation errors in multi-step math problems, even if the method is sound.

  • Strict logical deduction from specific problem constraints which may be intended as distractors versus actual information.

  • The model sometimes correctly identifies a family of solutions but fails to select the precise value requested.

Secondary metrics

  • Readability score: 0.0

  • Toxicity score: 0.000

  • Ethics score: 0.000

Run this evaluation yourself

Stratix evaluates Gemini 3.1 Pro Preview continuously across 11+ benchmarks. To replicate this Humanity's Last Exam evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.

Source: Stratix evaluation 699743445af7e0aa15943bfe. Updated 2026-02-19.