
Llama 4 Maverick on Terminal-Bench (Terminus-1): 8.8% accuracy
Author:
The LayerLens Team
Last updated:
Published:
The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.
Summary
Llama 4 Maverick from Meta scored 8.8 on Terminal-Bench (Terminus-1), placing it rank 63 of 77 on this benchmark. This places the model in the weak band for Terminal-Bench (Terminus-1). Below the threshold for production reliance on this benchmark family. Consider only for narrow, fully-tested tasks.
Model details
Provider: Meta
Model key:
meta-llama/llama-4-maverickContext length: 131,072 tokens
License: Llama 4
Open weights: yes
Benchmark methodology
Secondary metrics
Readability score: 31.6
Toxicity score: 0.002
Ethics score: 0.000
Run this evaluation yourself
Stratix evaluates Llama 4 Maverick continuously across 11+ benchmarks. To replicate this Terminal-Bench (Terminus-1) evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.
_Source: Stratix evaluation 68f59133da58e05b72452e5c. Updated 2025-10-20._