Llama 4 Scout on Humanity's Last Exam: 4.3% accuracy

Author:

The LayerLens Team

Last updated:

Published:

The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.

Summary

Llama 4 Scout from Meta scored 4.3 on Humanity's Last Exam, placing it rank 76 of 97 on this benchmark. This places the model in the weak band for Humanity's Last Exam. Below the threshold for production reliance on this benchmark family. Consider only for narrow, fully-tested tasks.

Model details

  • Provider: Meta

  • Model key: meta-llama/llama-4-scout

  • Context length: 172,000 tokens

  • License: Llama 4

  • Open weights: yes

Benchmark methodology

Benchmark goal: Evaluate the model's ability to perform tasks requiring common sense reasoning, including question answering, text generation, and logical inference.

Scoring metrics:

  • Exact Match Accuracy: Percentage of questions answered correctly based on exact match to a reference answer.

  • Semantic Similarity Score: Evaluates the semantic similarity between the generated text and the reference answer, accounting for paraphrasing and near-synonyms.

  • Relevance Score: Assesses the relevance and coherence of generated text, penalizing extraneous information or inconsistencies.

Analysis

Key takeaways:

  • The model demonstrates a baseline level of understanding across diverse task types.

  • Error analysis indicates a tendency toward generating overly verbose responses, and struggles with ambiguous prompts.

  • The results suggest the model is suited for tasks requiring recall of common sense reasoning, but will likely require more fine tuning to perform precisely, while minimizing superfluous content.

Failure modes observed

Common failure modes:

  • Overly verbose and/or rambling answer generation.

  • Misinterpretation of ambiguous prompts or questions.

  • Minor irrelevancies in text generation tasks

Example: For the prompt 'Explain the water cycle' the model generated a response that included an unnecessary introduction about the importance of water to living beings.

Secondary metrics

  • Readability score: 43.4

  • Toxicity score: 0.002

  • Ethics score: 0.000

Run this evaluation yourself

Stratix evaluates Llama 4 Scout continuously across 11+ benchmarks. To replicate this Humanity's Last Exam evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.

_Source: Stratix evaluation 681184185a4cd16846884d33. Updated 2025-04-30._