GPT-5 on Humanity's Last Exam: 21.7% accuracy

Author:

The LayerLens Team

Last updated:

Published:

The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.

Summary

GPT-5 from OpenAI scored 21.7 on Humanity's Last Exam, placing it top 10 (rank 8 of 97) on this benchmark. This places the model in the weak band for Humanity's Last Exam. Below the threshold for production reliance on this benchmark family. Consider only for narrow, fully-tested tasks.

Model details

  • Provider: OpenAI

  • Model key: openai/gpt-5

  • Context length: 400,000 tokens

  • License: Proprietary

  • Open weights: no

Benchmark methodology

Benchmark goal: Assess the LLM's ability to answer complex, knowledge-intensive questions in biology and medicine, requiring reasoning, synthesis of information, and understanding of experimental design, and quantitative skills.

Scoring metrics:

  • Accuracy: Percentage of questions answered correctly.

  • Toxicity: Average toxicity score of the generated responses, as measured by a toxicity classifier.

  • Readability: Average Flesch-Kincaid readability score of the generated responses.

  • Hallucination: A binary variable representing whether the answer include hallucination or not.

  • Duration: Time takes the model to respond a specific question.

Analysis

Key takeaways:

  • While GPT-5 demonstrates a capacity for knowledgeable and sophisticated natural language generation in the biology/medicine domain, its ability to reason accurately and apply expert knowledge effectively to diverse problem types including both multiple choice and open-ended questions remains limited.

  • The model exhibits uneven performance across different difficulty levels and a tendency to hallucinate information in an effort to fully answer the prompt, resulting in overconfidence exhibited through both generated responses and final probabilities for each question.

  • Significant shortfalls in quantitative problem solving, especially when requiring multi-step reasoning, setup and precise calculations limit practical applicability.

  • The model needs better image interpretation skills, because image-based questions were poorly handled compared to text based questions.

  • Further work is needed to improve factual recall, logical reasoning, quantitative skills, visual interpretation, and calibration of confidence scores.

Failure modes observed

Common failure modes:

  • Inaccurate recall of specific biology/medicine facts, leading to incorrect answer choices or explanations.

  • Failure to correctly perform multi-step reasoning required to arrive at the correct answer, even with factual knowledge.

  • Inability to accurately abstract and setup quantitative problems, leading to errors in calculation and incorrect numerical final answers.

  • Hallucinating steps or relationships in quantitative problems leading to incorrect setups and erroneous results.

  • Difficulty in applying knowledge across different contexts.

  • Poor or incorrect image recognition.

  • Overconfidence in incorrect answers, as reflected in high confidence scores despite errors and generating confident, lengthy rationales for inaccurate answers.

Example: In the prompt regarding kidney disease histopathology, the model correctly identifies FSGS and nodular glomerulosclerosis, but inaccurately applies these to provided images, resulting in the wrong answer. Despite this, model expresses 70% confidence. This showcases a breakdown in applying recalled information to specific contextual details.

Secondary metrics

  • Readability score: 51.0

  • Toxicity score: 0.001

  • Ethics score: 0.000

Run this evaluation yourself

Stratix evaluates GPT-5 continuously across 11+ benchmarks. To replicate this Humanity's Last Exam evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.

Source: Stratix evaluation 69004af5bf80df5bf71435b0. Updated 2025-11-02.