
GPT-5 on Humanity's Last Exam: 21.7% accuracy
Author:
The LayerLens Team
Last updated:
Published:
The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.
Summary
GPT-5 from OpenAI scored 21.7 on Humanity's Last Exam, placing it top 10 (rank 8 of 97) on this benchmark. This places the model in the weak band for Humanity's Last Exam. Below the threshold for production reliance on this benchmark family. Consider only for narrow, fully-tested tasks.
Model details
Provider: OpenAI
Model key:
openai/gpt-5Context length: 400,000 tokens
License: Proprietary
Open weights: no
Benchmark methodology
Benchmark goal: Assess the LLM's ability to answer complex, knowledge-intensive questions in biology and medicine, requiring reasoning, synthesis of information, and understanding of experimental design, and quantitative skills.
Scoring metrics:
Accuracy: Percentage of questions answered correctly.
Toxicity: Average toxicity score of the generated responses, as measured by a toxicity classifier.
Readability: Average Flesch-Kincaid readability score of the generated responses.
Hallucination: A binary variable representing whether the answer include hallucination or not.
Duration: Time takes the model to respond a specific question.
Analysis
Key takeaways:
While GPT-5 demonstrates a capacity for knowledgeable and sophisticated natural language generation in the biology/medicine domain, its ability to reason accurately and apply expert knowledge effectively to diverse problem types including both multiple choice and open-ended questions remains limited.
The model exhibits uneven performance across different difficulty levels and a tendency to hallucinate information in an effort to fully answer the prompt, resulting in overconfidence exhibited through both generated responses and final probabilities for each question.
Significant shortfalls in quantitative problem solving, especially when requiring multi-step reasoning, setup and precise calculations limit practical applicability.
The model needs better image interpretation skills, because image-based questions were poorly handled compared to text based questions.
Further work is needed to improve factual recall, logical reasoning, quantitative skills, visual interpretation, and calibration of confidence scores.
Failure modes observed
Common failure modes:
Inaccurate recall of specific biology/medicine facts, leading to incorrect answer choices or explanations.
Failure to correctly perform multi-step reasoning required to arrive at the correct answer, even with factual knowledge.
Inability to accurately abstract and setup quantitative problems, leading to errors in calculation and incorrect numerical final answers.
Hallucinating steps or relationships in quantitative problems leading to incorrect setups and erroneous results.
Difficulty in applying knowledge across different contexts.
Poor or incorrect image recognition.
Overconfidence in incorrect answers, as reflected in high confidence scores despite errors and generating confident, lengthy rationales for inaccurate answers.
Example: In the prompt regarding kidney disease histopathology, the model correctly identifies FSGS and nodular glomerulosclerosis, but inaccurately applies these to provided images, resulting in the wrong answer. Despite this, model expresses 70% confidence. This showcases a breakdown in applying recalled information to specific contextual details.
Secondary metrics
Readability score: 51.0
Toxicity score: 0.001
Ethics score: 0.000
Run this evaluation yourself
Stratix evaluates GPT-5 continuously across 11+ benchmarks. To replicate this Humanity's Last Exam evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.
Source: Stratix evaluation 69004af5bf80df5bf71435b0. Updated 2025-11-02.