Llama 4 Scout on Terminal-Bench (Terminus-1): 8.8% accuracy

Author:

The LayerLens Team

Last updated:

Published:

The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.

Summary

Llama 4 Scout from Meta scored 8.8 on Terminal-Bench (Terminus-1), placing it rank 64 of 77 on this benchmark. This places the model in the weak band for Terminal-Bench (Terminus-1). Below the threshold for production reliance on this benchmark family. Consider only for narrow, fully-tested tasks.

Model details

  • Provider: Meta

  • Model key: meta-llama/llama-4-scout

  • Context length: 172,000 tokens

  • License: Llama 4

  • Open weights: yes

Benchmark methodology

Secondary metrics

  • Readability score: 0.0

  • Toxicity score: 0.000

  • Ethics score: 0.000

Run this evaluation yourself

Stratix evaluates Llama 4 Scout continuously across 11+ benchmarks. To replicate this Terminal-Bench (Terminus-1) evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.

_Source: Stratix evaluation 6900b652fb9a09026f1efd4f. Updated 2025-10-28._