GLM 5.1 on AIME 2025: 90.0% accuracy

Author:

The LayerLens Team

Last updated:

May 13, 2026

Published:

Feb 25, 2026

The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.

Summary

GLM 5.1 from z-ai scored 90.0 on AIME 2025, placing it top 25 (rank 19 of 140) on this benchmark. This places the model in the saturated band for AIME 2025. Most frontier models cluster near this ceiling, so cross-benchmark behavior matters more than the headline score for production decisions.

Model details

Provider: z-ai
Model key: z-ai/glm-5.1
Context length: 202,752 tokens
License: MIT
Open weights: yes

Benchmark methodology

Benchmark goal: The benchmark is designed to evaluate single-shot problem-solving capabilities in advanced mathematics, specifically algebra, geometry, number theory, and combinatorics.

Scoring metrics:

Accuracy: A model receives credit only if it produces the exact three-digit integer answer for a problem.

Analysis

Key takeaways:

GLM 5.1 achieved an accuracy of 78.95% on the AIME 2025 benchmark, solving 15 out of 19 advanced mathematics problems correctly.
The model excels in problems requiring direct computation and formula application in algebra, geometry, and number theory.
Areas for improvement include complex combinatorial counting and precise handling of cyclic patterns in number theory problems involving large exponents or sequences.
The model's detailed step-by-step reasoning is often sound, but specific calculation errors or oversight of subtle conditions can lead to incorrect final answers.

Failure modes observed

Common failure modes:

Incorrect calculation in combinatorial problems, leading to a wrong count of permutations/combinations.
Misinterpretation of problem constraints or conditions.
Numerical errors or missteps in complex arithmetic or geometric constructions, despite correct initial setup.

Secondary metrics

Readability score: 0.0
Toxicity score: 0.000
Ethics score: 0.000

Run this evaluation yourself

Stratix evaluates GLM 5.1 continuously across 11+ benchmarks. To replicate this AIME 2025 evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.

Source: Stratix evaluation 69d54f99588ff89e9700d81f. Updated 2026-04-07.

‹ Claude Opus 4.7 on AIME 2026: 90.0% accuracy

GPT-5 (high) on AIME 2025: 90.0% accuracy ›