
Llama 4 Maverick on AIME 2025: 20.0% accuracy
Author:
The LayerLens Team
Last updated:
Published:
The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.
Summary
Llama 4 Maverick from Meta scored 20.0 on AIME 2025, placing it rank 102 of 140 on this benchmark. This places the model in the weak band for AIME 2025. Below the threshold for production reliance on this benchmark family. Consider only for narrow, fully-tested tasks.
Model details
Provider: Meta
Model key:
meta-llama/llama-4-maverickContext length: 131,072 tokens
License: Llama 4
Open weights: yes
Benchmark methodology
Benchmark goal: The benchmark aims to assess the LLM's mathematical reasoning capabilities, problem-solving skills, and ability to adhere to formatting constraints when answering mathematical competition problems.
Scoring metrics:
Score (Binary: 0 or 1): Indicates whether the final answer provided by the model is correct and adheres to all specified formatting rules.
Duration: The time taken by the model to generate a response.
Toxicity Score: A score between 0 and 1 indicating the likelihood that the generated text is toxic.
Readability Score: A score evaluating the readability of the model's output.
Analysis
Key takeaways:
Llama 4 Maverick model demonstrates proficiency in certain mathematical problem-solving tasks, particularly those requiring algebraic manipulation and logical deduction.
The model struggles with complex geometry problems and those requiring deep understanding of number theory concepts.
Adherence to the specified output format is generally good, but inconsistent errors impact overall scoring.
There is variability in performance across different subsets of the AIME 2025 Part I and II benchmarks.
Failure modes observed
Common failure modes:
Incorrect application of formulas or theorems.
Algebraic or arithmetic errors in complex calculations.
Misinterpretation of problem statements or geometric diagrams.
Failure to adhere strictly to the specified output format, leading to scoring penalties.
Premature termination of otherwise correct solutions due to calculation/logical errors.
Example: In one geometry problem, the model correctly identified similar triangles but made an error in calculating the area of the heptagon due to misapplication of the area ratios and incorrect numerical computations.
Secondary metrics
Readability score: 63.1
Toxicity score: 0.003
Ethics score: 0.000
Run this evaluation yourself
Stratix evaluates Llama 4 Maverick continuously across 11+ benchmarks. To replicate this AIME 2025 evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.
Source: Stratix evaluation 682be1dae457661947de836c. Updated 2025-05-20.