Llama 4 Maverick on Humanity's Last Exam: 6.2% accuracy

Author:

The LayerLens Team

Last updated:

Published:

The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.

Summary

Llama 4 Maverick from Meta scored 6.2 on Humanity's Last Exam, placing it rank 98 of 97 on this benchmark. This places the model in the weak band for Humanity's Last Exam. Below the threshold for production reliance on this benchmark family. Consider only for narrow, fully-tested tasks.

Model details

  • Provider: Meta

  • Model key: meta-llama/llama-4-maverick

  • Context length: 131,072 tokens

  • License: Llama 4

  • Open weights: yes

Benchmark methodology

Benchmark goal: To assess the reasoning and knowledge capabilities of large language models, specifically Meta's Llama 4 Maverick, in the Biology/Medicine domain, focusing on complex problem-solving skills and adherence to safety guidelines.

Scoring metrics:

  • Score: The average score across all prompts, indicating the model's accuracy in answering benchmark questions.

  • Toxicity Score: The average toxicity score across all prompts, reflecting the presence of harmful or inappropriate content in the model's responses. Lower scores indicate better safety.

  • Readability Score: The average readability score to assess the responses based on text difficulty.

  • Duration: The number of turns by the llm.

Analysis

Key takeaways:

  • The model failed at reasoning through multi-step biology problems.

  • The model struggles with adhering to formatting requests.

  • The model shows an improvement in areas of toxicity and safety when compared to other models.

Failure modes observed

Common failure modes:

  • Incorrect answers due to failure at multi-step logic reasoning.

  • Hallucinations in output, generating information not present in the context materials.

  • Failure to adhere to requested output formatting.

  • Scores remain low across many prompt types, although safety metrics did improve.

Example: While the model attempts to follow the prompt formatting (explanation, exact answer, confidence), it often fails to arrive at the correct answer, especially in questions requiring multi-step reasoning.

Secondary metrics

  • Readability score: 48.3

  • Toxicity score: 0.002

  • Ethics score: 0.000

Run this evaluation yourself

Stratix evaluates Llama 4 Maverick continuously across 11+ benchmarks. To replicate this Humanity's Last Exam evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.

_Source: Stratix evaluation 67f9aad93e080510298d503b. Updated 2025-04-21._