
GPT-5 (high) on AIME 2025: 90.0% accuracy
Author:
The LayerLens Team
Last updated:
Published:
The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.
Summary
GPT-5 (high) from OpenAI scored 90.0 on AIME 2025, placing it top 25 (rank 16 of 140) on this benchmark. This places the model in the saturated band for AIME 2025. Most frontier models cluster near this ceiling, so cross-benchmark behavior matters more than the headline score for production decisions.
Model details
Provider: OpenAI
Model key:
openai/gpt-5-highContext length: 400,000 tokens
License: Proprietary
Open weights: no
Benchmark methodology
Benchmark goal: Assess the problem-solving capabilities of LLMs on advanced mathematics problems.
Scoring metrics:
Accuracy: Percentage of questions answered correctly, matching the ground truth.
Toxicity, Readability, Hallucination: Metrics assessing the quality and safety of generated responses, although data is null.
Duration: Time taken to generate a response (in seconds). Not available.
Analysis
Key takeaways:
The model GPT-5 (high) shows a high accuracy in solving AIME problems, showing overall the ability to follow strict format rules.
GPT-5 struggles when it comes to solving Geometry or Trigonometry questions, but its performance can be improved by implementing better logical reasoning capabilities.
GPT-5 is strong at using strict format rules and tends to follow those rules.
Failure modes observed
Common failure modes:
Incorrect application of geometrical theorems.
Errors in combinatorial calculations, such as overcounting or undercounting.
Misinterpretation of problem constraints.
Example: In one geometry problem (Question 14 of AIME 2025 Part I), the model provided an incorrect answer due to a miscalculation of lengths and angles after a polygon rotation. In a Trigonometry problem (Question 6, AIME 2025 Part II) the result isn't in the requested format of degrees.
Secondary metrics
Readability score: 0.0
Toxicity score: 0.000
Ethics score: 0.000
Run this evaluation yourself
Stratix evaluates GPT-5 (high) continuously across 11+ benchmarks. To replicate this AIME 2025 evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.
Source: Stratix evaluation 68fce3b2d22669fc16613a8a. Updated 2025-10-25.