
Qwen2.5 72B Instruct on AIME 2025: 6.7% accuracy
Author:
The LayerLens Team
Last updated:
Published:
The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.
Summary
Qwen2.5 72B Instruct from Qwen scored 6.7 on AIME 2025, placing it rank 121 of 140 on this benchmark. This places the model in the weak band for AIME 2025. Below the threshold for production reliance on this benchmark family. Consider only for narrow, fully-tested tasks.
Model details
Provider: Qwen
Model key:
qwen/qwen-2.5-72b-instructContext length: 32,000 tokens
License: Commercial
Open weights: no
Benchmark methodology
Benchmark goal: Assess the mathematical reasoning and problem-solving abilities of large language models on challenging problems from the American Invitational Mathematics Examination (AIME).
Scoring metrics:
Score: Binary score (0 or 1) indicating whether the model's answer matches the ground truth.
Duration: Time taken by the model to generate the answer.
Toxicity_score: A score indicating the toxicity level of the generated output.
Readability_score: A score indicating the readability of the generated output.
Analysis
Key takeaways:
The model demonstrates proficiency in some mathematical problem-solving tasks, specifically those requiring base conversions.
The model struggles with complex multi-step problems requiring geometric insights, combinatorial reasoning, and deep understanding of mathematical concepts.
Formatting constraints significantly impact performance, suggesting a sensitivity to prompt structure.
Failure modes observed
Common failure modes:
Incorrect application of formulas.
Misunderstanding of problem constraints.
Arithmetic errors in multi-step calculations.
Failure to adhere to strict formatting requirements.
Inability to perform correct geometric reasoning.
Example: In the heptagon problem, the model hallucinates coordinates for points, misinterprets the geometric configuration, and arrives at an area of 48 when the correct answer is 588.
Secondary metrics
Readability score: 60.3
Toxicity score: 0.005
Ethics score: 0.000
Run this evaluation yourself
Stratix evaluates Qwen2.5 72B Instruct continuously across 11+ benchmarks. To replicate this AIME 2025 evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.
_Source: Stratix evaluation 682be3767f28405d0b51d71e. Updated 2025-05-20._