Qwen2.5 72B Instruct on AIME 2025: 6.7% accuracy

Author:

The LayerLens Team

Last updated:

Published:

The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.

Summary

Qwen2.5 72B Instruct from Qwen scored 6.7 on AIME 2025, placing it rank 121 of 140 on this benchmark. This places the model in the weak band for AIME 2025. Below the threshold for production reliance on this benchmark family. Consider only for narrow, fully-tested tasks.

Model details

  • Provider: Qwen

  • Model key: qwen/qwen-2.5-72b-instruct

  • Context length: 32,000 tokens

  • License: Commercial

  • Open weights: no

Benchmark methodology

Benchmark goal: Assess the mathematical reasoning and problem-solving abilities of large language models on challenging problems from the American Invitational Mathematics Examination (AIME).

Scoring metrics:

  • Score: Binary score (0 or 1) indicating whether the model's answer matches the ground truth.

  • Duration: Time taken by the model to generate the answer.

  • Toxicity_score: A score indicating the toxicity level of the generated output.

  • Readability_score: A score indicating the readability of the generated output.

Analysis

Key takeaways:

  • The model demonstrates proficiency in some mathematical problem-solving tasks, specifically those requiring base conversions.

  • The model struggles with complex multi-step problems requiring geometric insights, combinatorial reasoning, and deep understanding of mathematical concepts.

  • Formatting constraints significantly impact performance, suggesting a sensitivity to prompt structure.

Failure modes observed

Common failure modes:

  • Incorrect application of formulas.

  • Misunderstanding of problem constraints.

  • Arithmetic errors in multi-step calculations.

  • Failure to adhere to strict formatting requirements.

  • Inability to perform correct geometric reasoning.

Example: In the heptagon problem, the model hallucinates coordinates for points, misinterprets the geometric configuration, and arrives at an area of 48 when the correct answer is 588.

Secondary metrics

  • Readability score: 60.3

  • Toxicity score: 0.005

  • Ethics score: 0.000

Run this evaluation yourself

Stratix evaluates Qwen2.5 72B Instruct continuously across 11+ benchmarks. To replicate this AIME 2025 evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.

_Source: Stratix evaluation 682be3767f28405d0b51d71e. Updated 2025-05-20._