Qwen2.5 72B Instruct on Humanity's Last Exam: 3.7% accuracy

Author:

The LayerLens Team

Last updated:

Published:

The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.

Summary

Qwen2.5 72B Instruct from Qwen scored 3.7 on Humanity's Last Exam, placing it rank 98 of 97 on this benchmark. This places the model in the weak band for Humanity's Last Exam. Below the threshold for production reliance on this benchmark family. Consider only for narrow, fully-tested tasks.

Model details

  • Provider: Qwen

  • Model key: qwen/qwen-2.5-72b-instruct

  • Context length: 32,000 tokens

  • License: Commercial

  • Open weights: no

Benchmark methodology

Benchmark goal: Assess the LLM's proficiency in question answering across various domains, focusing on reasoning, format adherence, and knowledge retrieval.

Scoring metrics:

  • Score: Binary score indicating correctness of the answer (0 for incorrect, 1 for correct).

  • Toxicity Score: Float value representing the toxicity level of the generated text between 0 and 1, with higher scores indicating more toxic content.

  • Readability score: numerical score indicting the reability of the generated text.

  • Duration: Time in seconds the model takes to answer the question.

Analysis

Key takeaways:

  • Qwen2.5 72B Instruct shows mixed performance across the benchmark.

  • The model exhibits reasonable text generation capabilities (toxicity, readability).

  • Scores were either 0 or 1 indicating a lack of nuanced answers.

Failure modes observed

Common failure modes:

  • Incorrect answers due to complex reasoning requirements.

  • Failure to adhere to the requested output format.

  • Inability to handle tasks requiring external knowledge (e.g., identifying items in an unseen image).

  • Hallucinations and confabulations.

Example: The model often fails questions regarding image analysis or those that require real-time external data access (e.g., identifying objects in an image without the image, calculating exact values based on external information).

Secondary metrics

  • Readability score: 46.7

  • Toxicity score: 0.002

  • Ethics score: 0.000

Run this evaluation yourself

Stratix evaluates Qwen2.5 72B Instruct continuously across 11+ benchmarks. To replicate this Humanity's Last Exam evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.

_Source: Stratix evaluation 682bef057a279d418223e23f. Updated 2025-05-20._