
Llama 4 Scout on SWE-bench Lite (SWE-agent): 4.0% accuracy
Author:
The LayerLens Team
Last updated:
Published:
The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.
Summary
Llama 4 Scout from Meta scored 4.0 on SWE-bench Lite (SWE-agent), placing it top 50 (rank 37 of 45) on this benchmark. This places the model in the weak band for SWE-bench Lite (SWE-agent). Below the threshold for production reliance on this benchmark family. Consider only for narrow, fully-tested tasks.
Model details
Provider: Meta
Model key:
meta-llama/llama-4-scoutContext length: 172,000 tokens
License: Llama 4
Open weights: yes
Benchmark methodology
Benchmark goal: The benchmark aims to evaluate how well Large Language Models (LLMs) can generate code that is secure and free from common vulnerabilities. It specifically assesses the ability of LLMs to generate secure Python code.
Scoring metrics:
Security Accuracy: Calculated by the number of securely generated code snippets divided by the total number of code generation requests, averaged across 20 trials.
Analysis
Key takeaways:
The model exhibits a low success rate on the benchmark, with only 2 out of 60 attempts resulting in a successful resolution.
While capable of making isolated, correct code modifications for specific, well-defined problems, the model severely struggles with tasks requiring broader system understanding, dependency management, and debugging.
A significant portion of failures stem from basic Python syntax errors, incorrect API usage, or environment misconfigurations.
The model lacks effective self-correction and verification mechanisms.
Improvements are needed in understanding error messages, debugging complex library interactions, and generating syntactically and functionally correct code.
Failure modes observed
Common failure modes:
NameErrorandModuleNotFoundError: Frequent occurrences when attempting to import modules or access variables.TypeErrorandAttributeError: Pervasive across various libraries, indicating struggles with correct API usage.SyntaxError: Multiple instances suggest fundamental issues in generating syntactically correct Python code.IndentationError: Encountered in Python code modifications.Inability to Self-Correct/Verify: The model often failed to re-run modified tests successfully.
Misinterpretation of context/requirements: Proposed solutions did not align with expected behavior.
Example: In the BoundWidget.id_for_label task, the model initially attempts a fix that causes a TypeError due to incorrect handling of keyword arguments and then fails to resolve environmental dependencies.
Secondary metrics
Readability score: -19.5
Toxicity score: 0.003
Ethics score: 0.000
Run this evaluation yourself
Stratix evaluates Llama 4 Scout continuously across 11+ benchmarks. To replicate this SWE-bench Lite (SWE-agent) evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.
_Source: Stratix evaluation 691511a271b302ff237b24d5. Updated 2025-11-13._