GLM 5.1 on AIME 2025: 90.0% accuracy

Author:

The LayerLens Team

Last updated:

Published:

The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.

Summary

GLM 5.1 from z-ai scored 90.0 on AIME 2025, placing it top 25 (rank 19 of 140) on this benchmark. This places the model in the saturated band for AIME 2025. Most frontier models cluster near this ceiling, so cross-benchmark behavior matters more than the headline score for production decisions.

Model details

  • Provider: z-ai

  • Model key: z-ai/glm-5.1

  • Context length: 202,752 tokens

  • License: MIT

  • Open weights: yes

Benchmark methodology

Benchmark goal: The benchmark is designed to evaluate single-shot problem-solving capabilities in advanced mathematics, specifically algebra, geometry, number theory, and combinatorics.

Scoring metrics:

  • Accuracy: A model receives credit only if it produces the exact three-digit integer answer for a problem.

Analysis

Key takeaways:

  • GLM 5.1 achieved an accuracy of 78.95% on the AIME 2025 benchmark, solving 15 out of 19 advanced mathematics problems correctly.

  • The model excels in problems requiring direct computation and formula application in algebra, geometry, and number theory.

  • Areas for improvement include complex combinatorial counting and precise handling of cyclic patterns in number theory problems involving large exponents or sequences.

  • The model's detailed step-by-step reasoning is often sound, but specific calculation errors or oversight of subtle conditions can lead to incorrect final answers.

Failure modes observed

Common failure modes:

  • Incorrect calculation in combinatorial problems, leading to a wrong count of permutations/combinations.

  • Misinterpretation of problem constraints or conditions.

  • Numerical errors or missteps in complex arithmetic or geometric constructions, despite correct initial setup.

Secondary metrics

  • Readability score: 0.0

  • Toxicity score: 0.000

  • Ethics score: 0.000

Run this evaluation yourself

Stratix evaluates GLM 5.1 continuously across 11+ benchmarks. To replicate this AIME 2025 evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.

Source: Stratix evaluation 69d54f99588ff89e9700d81f. Updated 2026-04-07.