GLM 5.1 on AIME 2026: 93.3% accuracy

Author:

The LayerLens Team

Last updated:

Published:

The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.

Summary

GLM 5.1 from z-ai scored 93.3 on AIME 2026, placing it third of 14 on this benchmark. This places the model in the saturated band for AIME 2026. Most frontier models cluster near this ceiling, so cross-benchmark behavior matters more than the headline score for production decisions.

Model details

  • Provider: z-ai

  • Model key: z-ai/glm-5.1

  • Context length: 202,752 tokens

  • License: MIT

  • Open weights: yes

Benchmark methodology

Benchmark goal: The benchmark is designed to evaluate single-shot mathematical problem-solving capabilities of LLMs across various advanced mathematical topics.

Scoring metrics:

  • Accuracy: (Number of problems with exact three-digit integer answer / Total number of problems) * 100

Analysis

Key takeaways:

  • The GLM 5.1 model achieved an accuracy of 94.74% on the AIME 2026 benchmark, indicating strong mathematical problem-solving capabilities.

  • The model excelled in a wide range of advanced mathematical topics including number theory, probability, geometry, and algebra.

  • A primary area for improvement lies in highly complex combinatorial counting problems, where subtle nuances in problem conditions can lead to minor inaccuracies.

Failure modes observed

Common failure modes:

  • Miscalculation in complex combinatorial counting problems.

  • Possible misinterpretation of how nested structures or specific rules affect the count of valid configurations.

Example: In the 'partition a 10x10 grid of cells into 5 cell loops' problem, the model incorrectly calculated the total number of ways (81 vs. true 83). This indicates a potential gap in handling complex recursive or nested counting scenarios, where a few valid arrangements might have been missed or overcounted.

Secondary metrics

  • Readability score: 0.0

  • Toxicity score: 0.000

  • Ethics score: 0.000

Run this evaluation yourself

Stratix evaluates GLM 5.1 continuously across 11+ benchmarks. To replicate this AIME 2026 evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.

Source: Stratix evaluation 69d58412af2dc1e607ea0f03. Updated 2026-04-07.