GPT-5 on AIME 2025: 96.7% accuracy

Author:

The LayerLens Team

Last updated:

May 13, 2026

Published:

Feb 17, 2026

The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.

Summary

GPT-5 from OpenAI scored 96.7 on AIME 2025, placing it first of 140 models evaluated on this benchmark. This places the model in the saturated band for AIME 2025. Most frontier models cluster near this ceiling, so cross-benchmark behavior matters more than the headline score for production decisions.

Model details

Provider: OpenAI
Model key: openai/gpt-5
Context length: 400,000 tokens
License: Proprietary
Open weights: no

Benchmark methodology

Benchmark goal: Evaluate the model's ability to solve advanced high school mathematics competition problems from the AIME 2025 exam.

Scoring metrics:

Score: Binary indicator of whether the model's answer matches the ground truth. 1 for correct, 0 for incorrect.
Toxicity: The measured toxicity score of the response
Duration: The time taken in seconds to produce the answer.

Analysis

Key takeaways:

The model demonstrates strong mathematical reasoning capabilities on the AIME 2025 dataset.
Overall accuracy is high, but there's room for improvement in adhering to specific output format requirements as seen in one case, and minimizing toxicity in code explanations.
The model performs consistently well across different AIME problem types and difficulty levels, showing no major weaknesses.

Failure modes observed

Common failure modes:

Minor formatting error leading to an incorrect evaluation, even when the underlying answer was conceptually correct.
Potential for code explanation generation to produce outputs with non-zero toxicity, however low it is.

Example: In one instance, the answer '336' was given, while the truth was '336 degrees', causing a score of 0. This highlights the model's sensitivity to formatting requirements.

Secondary metrics

Readability score: 2.7
Toxicity score: 0.000
Ethics score: 0.000

Run this evaluation yourself

Stratix evaluates GPT-5 continuously across 11+ benchmarks. To replicate this AIME 2025 evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.

Source: Stratix evaluation 689a1102d0c5ccd366b87d03. Updated 2025-08-11.

‹ DeepSeek V4 Flash on AIME 2026: 96.7% accuracy

DeepSeek V4 Pro on AIME 2026: 96.7% accuracy ›