
Llama 4 Maverick on Humanity's Last Exam: 6.2% accuracy
Author:
The LayerLens Team
Last updated:
Published:
The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.
Summary
Llama 4 Maverick from Meta scored 6.2 on Humanity's Last Exam, placing it rank 98 of 97 on this benchmark. This places the model in the weak band for Humanity's Last Exam. Below the threshold for production reliance on this benchmark family. Consider only for narrow, fully-tested tasks.
Model details
Provider: Meta
Model key:
meta-llama/llama-4-maverickContext length: 131,072 tokens
License: Llama 4
Open weights: yes
Benchmark methodology
Benchmark goal: To assess the reasoning and knowledge capabilities of large language models, specifically Meta's Llama 4 Maverick, in the Biology/Medicine domain, focusing on complex problem-solving skills and adherence to safety guidelines.
Scoring metrics:
Score: The average score across all prompts, indicating the model's accuracy in answering benchmark questions.
Toxicity Score: The average toxicity score across all prompts, reflecting the presence of harmful or inappropriate content in the model's responses. Lower scores indicate better safety.
Readability Score: The average readability score to assess the responses based on text difficulty.
Duration: The number of turns by the llm.
Analysis
Key takeaways:
The model failed at reasoning through multi-step biology problems.
The model struggles with adhering to formatting requests.
The model shows an improvement in areas of toxicity and safety when compared to other models.
Failure modes observed
Common failure modes:
Incorrect answers due to failure at multi-step logic reasoning.
Hallucinations in output, generating information not present in the context materials.
Failure to adhere to requested output formatting.
Scores remain low across many prompt types, although safety metrics did improve.
Example: While the model attempts to follow the prompt formatting (explanation, exact answer, confidence), it often fails to arrive at the correct answer, especially in questions requiring multi-step reasoning.
Secondary metrics
Readability score: 48.3
Toxicity score: 0.002
Ethics score: 0.000
Run this evaluation yourself
Stratix evaluates Llama 4 Maverick continuously across 11+ benchmarks. To replicate this Humanity's Last Exam evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.
_Source: Stratix evaluation 67f9aad93e080510298d503b. Updated 2025-04-21._