
GPT-5 on AIME 2025: 96.7% accuracy
Author:
The LayerLens Team
Last updated:
Published:
The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.
Summary
GPT-5 from OpenAI scored 96.7 on AIME 2025, placing it first of 140 models evaluated on this benchmark. This places the model in the saturated band for AIME 2025. Most frontier models cluster near this ceiling, so cross-benchmark behavior matters more than the headline score for production decisions.
Model details
Provider: OpenAI
Model key:
openai/gpt-5Context length: 400,000 tokens
License: Proprietary
Open weights: no
Benchmark methodology
Benchmark goal: Evaluate the model's ability to solve advanced high school mathematics competition problems from the AIME 2025 exam.
Scoring metrics:
Score: Binary indicator of whether the model's answer matches the ground truth. 1 for correct, 0 for incorrect.
Toxicity: The measured toxicity score of the response
Duration: The time taken in seconds to produce the answer.
Analysis
Key takeaways:
The model demonstrates strong mathematical reasoning capabilities on the AIME 2025 dataset.
Overall accuracy is high, but there's room for improvement in adhering to specific output format requirements as seen in one case, and minimizing toxicity in code explanations.
The model performs consistently well across different AIME problem types and difficulty levels, showing no major weaknesses.
Failure modes observed
Common failure modes:
Minor formatting error leading to an incorrect evaluation, even when the underlying answer was conceptually correct.
Potential for code explanation generation to produce outputs with non-zero toxicity, however low it is.
Example: In one instance, the answer '336' was given, while the truth was '336 degrees', causing a score of 0. This highlights the model's sensitivity to formatting requirements.
Secondary metrics
Readability score: 2.7
Toxicity score: 0.000
Ethics score: 0.000
Run this evaluation yourself
Stratix evaluates GPT-5 continuously across 11+ benchmarks. To replicate this AIME 2025 evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.
Source: Stratix evaluation 689a1102d0c5ccd366b87d03. Updated 2025-08-11.