GPT-5 (high) on AIME 2025: 90.0% accuracy

Author:

The LayerLens Team

Last updated:

May 13, 2026

Published:

Feb 23, 2026

The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.

Summary

GPT-5 (high) from OpenAI scored 90.0 on AIME 2025, placing it top 25 (rank 16 of 140) on this benchmark. This places the model in the saturated band for AIME 2025. Most frontier models cluster near this ceiling, so cross-benchmark behavior matters more than the headline score for production decisions.

Model details

Provider: OpenAI
Model key: openai/gpt-5-high
Context length: 400,000 tokens
License: Proprietary
Open weights: no

Benchmark methodology

Benchmark goal: Assess the problem-solving capabilities of LLMs on advanced mathematics problems.

Scoring metrics:

Accuracy: Percentage of questions answered correctly, matching the ground truth.
Toxicity, Readability, Hallucination: Metrics assessing the quality and safety of generated responses, although data is null.
Duration: Time taken to generate a response (in seconds). Not available.

Analysis

Key takeaways:

The model GPT-5 (high) shows a high accuracy in solving AIME problems, showing overall the ability to follow strict format rules.
GPT-5 struggles when it comes to solving Geometry or Trigonometry questions, but its performance can be improved by implementing better logical reasoning capabilities.
GPT-5 is strong at using strict format rules and tends to follow those rules.

Failure modes observed

Common failure modes:

Incorrect application of geometrical theorems.
Errors in combinatorial calculations, such as overcounting or undercounting.
Misinterpretation of problem constraints.

Example: In one geometry problem (Question 14 of AIME 2025 Part I), the model provided an incorrect answer due to a miscalculation of lengths and angles after a polygon rotation. In a Trigonometry problem (Question 6, AIME 2025 Part II) the result isn't in the requested format of degrees.

Secondary metrics

Readability score: 0.0
Toxicity score: 0.000
Ethics score: 0.000

Run this evaluation yourself

Stratix evaluates GPT-5 (high) continuously across 11+ benchmarks. To replicate this AIME 2025 evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.

Source: Stratix evaluation 68fce3b2d22669fc16613a8a. Updated 2025-10-25.

‹ GLM 5.1 on AIME 2025: 90.0% accuracy

Gemini 3.1 Pro Preview on AIME 2025: 93.3% accuracy ›