Gemini 3.1 Flash Lite Preview on AIME 2025: 30.0% accuracy

Author:

The LayerLens Team

Last updated:

May 13, 2026

Published:

Apr 11, 2026

The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.

Summary

Gemini 3.1 Flash Lite Preview from Google scored 30.0 on AIME 2025, placing it rank 88 of 140 on this benchmark. This places the model in the weak band for AIME 2025. Below the threshold for production reliance on this benchmark family. Consider only for narrow, fully-tested tasks.

Model details

Provider: Google
Model key: google/gemini-3.1-flash-lite-preview
Context length: 1,048,576 tokens
License: Proprietary
Open weights: no

Benchmark methodology

Benchmark goal: Solve advanced mathematical problems in a contest setting correctly. CRITICAL ANSWER FORMATTING RULES should be strictly followed to ensure parseable and scorable outputs.

Scoring metrics:

Score: Indicates if the model's answer matches the truth. binary: 1 for correct, 0 for incorrect.
Duration: Time taken by the model to generate the response (in seconds).
Accuracy: (Number of correctly solved problems / Total number of problems) * 100

Analysis

Key takeaways:

The Gemini 3.1 Flash Lite Preview model achieved a low accuracy of 26.32% on this advanced mathematics benchmark, correctly solving only 5 out of 19 problems.
The model performs better on problems that can be solved through direct application of formulas or relatively simple algebraic isolations.
Its performance significantly degrades with increased problem complexity, interdisciplinary requirements, or sophisticated combinatorial counting.
There is a clear need for improvement in handling multi-step reasoning, precise application of mathematical theorems, and preventing cascading calculation errors for these types of challenges.

Failure modes observed

Common failure modes:

Incorrect intermediate calculations, leading to errors in the final answer.
Misinterpretation of problem constraints or geometric descriptions.
Algebraic errors, especially when dealing with complex expressions or multiple variables.
Over-simplification of complex counting or probability scenarios.

Secondary metrics

Readability score: 0.0
Toxicity score: 0.000
Ethics score: 0.000

Run this evaluation yourself

Stratix evaluates Gemini 3.1 Flash Lite Preview continuously across 11+ benchmarks. To replicate this AIME 2025 evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.

_Source: Stratix evaluation 69a743435a24fc3a34525ba6. Updated 2026-03-03._

‹ Claude Opus 4.1 on AIME 2025: 26.7% accuracy

Claude Opus 4.7 on Humanity's Last Exam: 30.8% accuracy ›