
Qwen2.5 72B Instruct on Humanity's Last Exam: 3.7% accuracy
Author:
The LayerLens Team
Last updated:
Published:
The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.
Summary
Qwen2.5 72B Instruct from Qwen scored 3.7 on Humanity's Last Exam, placing it rank 98 of 97 on this benchmark. This places the model in the weak band for Humanity's Last Exam. Below the threshold for production reliance on this benchmark family. Consider only for narrow, fully-tested tasks.
Model details
Provider: Qwen
Model key:
qwen/qwen-2.5-72b-instructContext length: 32,000 tokens
License: Commercial
Open weights: no
Benchmark methodology
Benchmark goal: Assess the LLM's proficiency in question answering across various domains, focusing on reasoning, format adherence, and knowledge retrieval.
Scoring metrics:
Score: Binary score indicating correctness of the answer (0 for incorrect, 1 for correct).
Toxicity Score: Float value representing the toxicity level of the generated text between 0 and 1, with higher scores indicating more toxic content.
Readability score: numerical score indicting the reability of the generated text.
Duration: Time in seconds the model takes to answer the question.
Analysis
Key takeaways:
Qwen2.5 72B Instruct shows mixed performance across the benchmark.
The model exhibits reasonable text generation capabilities (toxicity, readability).
Scores were either 0 or 1 indicating a lack of nuanced answers.
Failure modes observed
Common failure modes:
Incorrect answers due to complex reasoning requirements.
Failure to adhere to the requested output format.
Inability to handle tasks requiring external knowledge (e.g., identifying items in an unseen image).
Hallucinations and confabulations.
Example: The model often fails questions regarding image analysis or those that require real-time external data access (e.g., identifying objects in an image without the image, calculating exact values based on external information).
Secondary metrics
Readability score: 46.7
Toxicity score: 0.002
Ethics score: 0.000
Run this evaluation yourself
Stratix evaluates Qwen2.5 72B Instruct continuously across 11+ benchmarks. To replicate this Humanity's Last Exam evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.
_Source: Stratix evaluation 682bef057a279d418223e23f. Updated 2025-05-20._