
Gemini 3.1 Pro Preview on AIME 2025: 93.3% accuracy
Author:
The LayerLens Team
Last updated:
Published:
The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.
Summary
Gemini 3.1 Pro Preview from Google scored 93.3 on AIME 2025, placing it top 10 (rank 5 of 140) on this benchmark. This places the model in the saturated band for AIME 2025. Most frontier models cluster near this ceiling, so cross-benchmark behavior matters more than the headline score for production decisions.
Model details
Provider: Google
Model key:
google/gemini-3.1-pro-previewContext length: 1,048,576 tokens
License: Proprietary
Open weights: no
Benchmark methodology
Benchmark goal: Evaluate the model's ability to solve complex mathematical problems typical of the American Invitational Mathematics Examination (AIME), requiring deep understanding of various mathematical concepts, logical deduction, and precise calculations.
Scoring metrics:
Score: Binary metric (0 or 1) indicating whether the final answer is correct.
Duration: Time taken by the model to generate the response (in seconds).
Input Tokens: Number of tokens in the input prompt.
Output Tokens: Number of tokens in the generated response.
Analysis
Key takeaways:
The Gemini 3.1 Pro Preview model demonstrates a strong aptitude for solving advanced mathematical problems, achieving a high success rate on a challenging AIME-level benchmark.
Its ability to provide comprehensive, well-structured derivations is a significant strength, even when the final answer is incorrect due to minor errors.
Key areas for improvement involve more robust interpretation of complex geometric conditions, preventing minor calculation errors, and ensuring correct application of specialized mathematical theorems under specific problem constraints.
Failure modes observed
Common failure modes:
Misinterpretation of geometric conditions or properties.
Incorrect application of trigonometric identities or angle relationships when dealing with complex geometric setups.
Small calculation errors in the final steps after extensive correct work.
Potential for misunderstanding graph properties or specialized mathematical concepts under specific conditions of problem.
Secondary metrics
Readability score: 0.0
Toxicity score: 0.000
Ethics score: 0.000
Run this evaluation yourself
Stratix evaluates Gemini 3.1 Pro Preview continuously across 11+ benchmarks. To replicate this AIME 2025 evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.
Source: Stratix evaluation 699743335af7e0aa15943bfb. Updated 2026-02-19.