Claude Opus 4.6 on Terminal-Bench (Terminus-2): 58.8% accuracy

Author:

The LayerLens Team

Last updated:

Published:

The LayerLens Team covers AI model evaluations, benchmark analysis, and the evolving landscape of AI performance. For the latest independent evaluation data, explore Stratix.

Summary

Claude Opus 4.6 from Anthropic scored 58.8 on Terminal-Bench (Terminus-2), placing it first of 21 models evaluated on this benchmark. This places the model in the below-frontier band for Terminal-Bench (Terminus-2). Acceptable for cost-sensitive workloads or as part of a multi-model ensemble. Not a default choice for high-stakes routing.

Model details

  • Provider: Anthropic

  • Model key: anthropic/claude-opus-4.6

  • Context length: 200,000 tokens

  • License: Proprietary

  • Open weights: no

Benchmark methodology

Benchmark goal: Evaluate the robustness and reliability of LLMs in solving mathematical word problems, specifically focusing on their ability to handle adversarial examples (distractions, numerical perturbations, irrelevant information) and gauge their reasoning capabilities beyond simple keyword matching.

Scoring metrics:

  • Accuracy (Overall): The percentage of mathematical word problems solved correctly when considering all test cases, including adversarial examples.

  • Accuracy (Clean): The percentage of mathematical word problems solved correctly on the original, non-adversarial version of the problems.

  • Robustness Score: The ratio of an LLM's accuracy on adversarial examples to its accuracy on clean examples for each specific adversarial type (distraction, numerical, irrelevant). A score closer to 1 indicates higher robustness.

  • Factual Correctness: The correctness of the final numerical answer.

  • Explanation Quality Score: A score (likely qualitative or based on human assessment/rubric) evaluating the clarity, correctness, and completeness of the LLM's step-by-step reasoning or explanation.

Analysis

Key takeaways:

  • Claude Opus 4.6 exhibits a decent overall pass rate of 65.38% on this benchmark, indicating a strong foundation in understanding and executing a variety of practical tasks.

  • The model performs well on tasks requiring straightforward application of tools, system configuration, and data management.

  • Areas for improvement lie in complex algorithmic implementations, nuanced debugging of pre-existing code, precise text/data extraction, and scenarios requiring deep understanding of system internals or obscure contexts (e.g., QDP file parsing, custom kernel building).

Failure modes observed

Common failure modes:

  • Misinterpretation of task requirements leading to incorrect output formats or incomplete solutions.

  • Lack of detailed algorithmic implementation for complex scenarios, resulting in non-working code or partial solutions.

  • Difficulty in debugging subtle software integration issues or unexpected behavior from libraries/systems.

  • Inability to generate or verify specific outputs when external system interaction is implied.

Example: In the task to 'Train a roberta-base model on the RTE dataset using the UPET method', the model scored 0, indicating it likely failed to correctly configure the training environment or execute the training process according to the specified constraints (1 epoch, 5 examples per label, specific hyperparameters) to produce the expected accuracy report.

Secondary metrics

  • Readability score: 0.0

  • Toxicity score: 0.000

  • Ethics score: 0.000

Run this evaluation yourself

Stratix evaluates Claude Opus 4.6 continuously across 11+ benchmarks. To replicate this Terminal-Bench (Terminus-2) evaluation on your own model, traces, or a different benchmark configuration, open the model in Stratix.

_Source: Stratix evaluation 69863670d5968947e86ac1d3. Updated 2026-02-06._