Mar 4, 2025
As reasoning models continue to advance, new benchmarking datasets have become essential for assessing their capabilities. Many older benchmarks, like MMLU, are now outdated—frontier models routinely score above 90%, making them less useful for distinguishing real progress. In response, AI safety researchers and infrastructure providers are developing harder, more practical datasets to better reflect the evolving landscape of AI capabilities.
One such dataset is Humanity’s Last Exam (HLE), created by Scale AI and the Center for AI Safety (CAIS). Designed to be sufficiently challenging, HLE spans mathematics, reasoning, history, and science. Unlike traditional datasets built by experts, HLE is entirely crowdsourced, allowing people from diverse backgrounds to submit complex and creative questions that push AI reasoning to its limits.
Initial Results: Testing Frontier AI Models on HLE
This week, we present several examples of prompts in HLE, directly from evaluations run on the LayerLens Atlas Platform.
The left represents the question that was asked to the model, while the result and truth columns represent if the model got the question correct.

Anthropic 3.7 Sonnet, performance on first question

Result of Google Gemini 1.5 getting a question correct
These questions, even if static or multiple-choice in their structure, are extremely complex, requiring high-end reasoning capabilities to solve correctly. Most models are challenged by it, with frontier models such as O1, DeepSeek R1, and Claude 3.7 Sonnet all scoring less than 10% on it.
The Limitations of HLE: How Useful Are Extremely Hard Benchmarks?
HLE is undeniably a difficult benchmark, but its practical value remains uncertain. Some of its questions are far removed from how LLMs are typically used by consumers and enterprises, raising the question: are we prioritizing difficulty over real-world relevance? While challenging datasets are necessary to push AI forward, they must also reflect the practical needs of those who use these models.
For instance, HLE contains complex queries requiring specialized knowledge or intricate reasoning—scenarios that may not align with the everyday applications of LLMs. While these questions test the upper limits of model reasoning, they do not necessarily measure real-world effectiveness.

Example: A mathematics question from Humanity’s Last Exam
This misalignment between benchmark tasks and practical applications risks distorting our understanding of a model’s utility. A system that excels at answering obscure, highly technical HLE questions might not be the best at providing clear, reliable insights in real-world contexts.
Striking the right balance between difficulty and applicability is critical. AI models should be evaluated on their ability to handle both extreme reasoning tasks and practical challenges that matter to users.
HLE serves as a valuable stress test for LLMs, but recognizing its limitations is essential. Ensuring that AI evaluation remains both rigorous and relevant will help bridge the gap between theoretical advancements and meaningful, real-world impact.
EXPLORE MORE ARTICLES
PREVIOUS
NEXT