Gemini 3.1 Pro Preview on AIME 2025: 93.3% accuracy

Author:

The LayerLens Team

Last updated:

Published:

The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.

Summary

Gemini 3.1 Pro Preview from Google scored 93.3 on AIME 2025, placing it top 10 (rank 5 of 140) on this benchmark. This places the model in the saturated band for AIME 2025. Most frontier models cluster near this ceiling, so cross-benchmark behavior matters more than the headline score for production decisions.

Model details

  • Provider: Google

  • Model key: google/gemini-3.1-pro-preview

  • Context length: 1,048,576 tokens

  • License: Proprietary

  • Open weights: no

Benchmark methodology

Benchmark goal: Evaluate the model's ability to solve complex mathematical problems typical of the American Invitational Mathematics Examination (AIME), requiring deep understanding of various mathematical concepts, logical deduction, and precise calculations.

Scoring metrics:

  • Score: Binary metric (0 or 1) indicating whether the final answer is correct.

  • Duration: Time taken by the model to generate the response (in seconds).

  • Input Tokens: Number of tokens in the input prompt.

  • Output Tokens: Number of tokens in the generated response.

Analysis

Key takeaways:

  • The Gemini 3.1 Pro Preview model demonstrates a strong aptitude for solving advanced mathematical problems, achieving a high success rate on a challenging AIME-level benchmark.

  • Its ability to provide comprehensive, well-structured derivations is a significant strength, even when the final answer is incorrect due to minor errors.

  • Key areas for improvement involve more robust interpretation of complex geometric conditions, preventing minor calculation errors, and ensuring correct application of specialized mathematical theorems under specific problem constraints.

Failure modes observed

Common failure modes:

  • Misinterpretation of geometric conditions or properties.

  • Incorrect application of trigonometric identities or angle relationships when dealing with complex geometric setups.

  • Small calculation errors in the final steps after extensive correct work.

  • Potential for misunderstanding graph properties or specialized mathematical concepts under specific conditions of problem.

Secondary metrics

  • Readability score: 0.0

  • Toxicity score: 0.000

  • Ethics score: 0.000

Run this evaluation yourself

Stratix evaluates Gemini 3.1 Pro Preview continuously across 11+ benchmarks. To replicate this AIME 2025 evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.

Source: Stratix evaluation 699743335af7e0aa15943bfb. Updated 2026-02-19.