Llama 4 Scout on SWE-bench Lite (SWE-agent): 4.0% accuracy

Author:

The LayerLens Team

Last updated:

Published:

The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.

Summary

Llama 4 Scout from Meta scored 4.0 on SWE-bench Lite (SWE-agent), placing it top 50 (rank 37 of 45) on this benchmark. This places the model in the weak band for SWE-bench Lite (SWE-agent). Below the threshold for production reliance on this benchmark family. Consider only for narrow, fully-tested tasks.

Model details

  • Provider: Meta

  • Model key: meta-llama/llama-4-scout

  • Context length: 172,000 tokens

  • License: Llama 4

  • Open weights: yes

Benchmark methodology

Benchmark goal: The benchmark aims to evaluate how well Large Language Models (LLMs) can generate code that is secure and free from common vulnerabilities. It specifically assesses the ability of LLMs to generate secure Python code.

Scoring metrics:

  • Security Accuracy: Calculated by the number of securely generated code snippets divided by the total number of code generation requests, averaged across 20 trials.

Analysis

Key takeaways:

  • The model exhibits a low success rate on the benchmark, with only 2 out of 60 attempts resulting in a successful resolution.

  • While capable of making isolated, correct code modifications for specific, well-defined problems, the model severely struggles with tasks requiring broader system understanding, dependency management, and debugging.

  • A significant portion of failures stem from basic Python syntax errors, incorrect API usage, or environment misconfigurations.

  • The model lacks effective self-correction and verification mechanisms.

  • Improvements are needed in understanding error messages, debugging complex library interactions, and generating syntactically and functionally correct code.

Failure modes observed

Common failure modes:

  • NameError and ModuleNotFoundError: Frequent occurrences when attempting to import modules or access variables.

  • TypeError and AttributeError: Pervasive across various libraries, indicating struggles with correct API usage.

  • SyntaxError: Multiple instances suggest fundamental issues in generating syntactically correct Python code.

  • IndentationError: Encountered in Python code modifications.

  • Inability to Self-Correct/Verify: The model often failed to re-run modified tests successfully.

  • Misinterpretation of context/requirements: Proposed solutions did not align with expected behavior.

Example: In the BoundWidget.id_for_label task, the model initially attempts a fix that causes a TypeError due to incorrect handling of keyword arguments and then fails to resolve environmental dependencies.

Secondary metrics

  • Readability score: -19.5

  • Toxicity score: 0.003

  • Ethics score: 0.000

Run this evaluation yourself

Stratix evaluates Llama 4 Scout continuously across 11+ benchmarks. To replicate this SWE-bench Lite (SWE-agent) evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.

_Source: Stratix evaluation 691511a271b302ff237b24d5. Updated 2025-11-13._