Llama 4 Scout on AIME 2025: 6.7% accuracy

Author:

The LayerLens Team

Last updated:

Published:

The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.

Summary

Llama 4 Scout from Meta scored 6.7 on AIME 2025, placing it rank 124 of 140 on this benchmark. This places the model in the weak band for AIME 2025. Below the threshold for production reliance on this benchmark family. Consider only for narrow, fully-tested tasks.

Model details

  • Provider: Meta

  • Model key: meta-llama/llama-4-scout

  • Context length: 172,000 tokens

  • License: Llama 4

  • Open weights: yes

Benchmark methodology

Benchmark goal: The benchmark assesses the model's ability to solve advanced mathematical problems from the American Invitational Mathematics Examination (AIME).

Scoring metrics:

  • Score: Binary score (0 or 1) indicating whether the model's answer matches the ground truth, taking LaTeX formatting into account.

  • Duration: Time taken to generate the answer, measured in seconds.

  • Toxicity Score: Toxicity score on a scale from 0 to 1, indicating the presence of inappropriate content.

  • Readability Score: Flesch Reading Ease score which rates text on a 100-point scale; the higher the score, the easier it is to understand the document.

Analysis

Key takeaways:

  • The model demonstrates proficiency in solving mathematical problems that require step-by-step reasoning.

  • The model sometimes struggles with problems requiring complex geometric insights or advanced combinatorial arguments.

  • Answer formatting requirements significantly impact the model's scored performance.

Failure modes observed

Common failure modes:

  • Incorrect application of formulas (e.g., area of a trapezoid).

  • Errors in algebraic manipulation and simplification.

  • Failure to account for all constraints in combinatorial problems.

  • Misinterpretation of geometric properties or symmetry.

  • Incorrect calculation of series, especially in telescoping products.

  • Failure to adhere to the specified answer format, despite otherwise correct reasoning.

Example: In the heptagon area problem, the model correctly identifies many initial steps but misinterprets area relationships leading to an incorrect final answer despite demonstrating a good understanding of the problem's setup. Ultimately the model derives an incorrect final answer of 720 when the correct answer is 588.

Secondary metrics

  • Readability score: 59.3

  • Toxicity score: 0.003

  • Ethics score: 0.000

Run this evaluation yourself

Stratix evaluates Llama 4 Scout continuously across 11+ benchmarks. To replicate this AIME 2025 evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.

_Source: Stratix evaluation 682be1e213dbeaa3dad2e6cd. Updated 2025-05-20._