Introducing TerminalWorld · TerminalWorld

Today we’re sharing TerminalWorld, a benchmark for evaluating AI agents on terminal tasks, the kind of work developers actually do every day.

Why we built this

Existing terminal benchmarks are mostly hand-crafted by researchers. That’s fine for controlled evaluation, but it means the tasks reflect what researchers think developers do, not what they actually do. We wanted to find out: how well do AI agents perform on the real thing?

So instead of writing tasks from scratch, we built a data engine that learns from developers themselves.

From 80,870 recordings to 1,530 tasks

The pipeline has four stages:

Collect: start with 80,870 raw asciinema recordings, filter out noise (PII, TUI interactions, low-signal sessions), and keep 9,492 high-quality recordings
Synthesize: an LLM processes all 9,492 recordings, distilling each noisy transcript into an outcome-oriented instruction and a clean reference solution
Reproduce: an LLM agent reverse-engineers the Docker environment each task needs, using runtime failures as feedback, successfully reproducing 5,035 environments
Validate: the pipeline generates test assertions from pre/post execution states and runs three rounds of validation, confirming 1,530 tasks

After all that: 1,530 validated tasks across 18 categories, covering everything from containers and CI/CD to ML training and performance optimization. We also manually reviewed 200 of these tasks ourselves, cross-verified by four authors, to create a human-verified subset.

What we found

We evaluated eight frontier models using the Terminus-2 agent framework. The best result: Claude Opus 4.7 at 62.5% pass rate. The average across all models was 54.8%.

A few things surprised us:

Real-world tasks are genuinely hard. Models that score over 80% on Terminal-Bench drop to around 53% on TerminalWorld. The correlation between the two benchmarks is only 0.20. They’re measuring different things.

Performance varies a lot by domain. Models handle environment setup well (87.5%) but struggle with performance optimization (28.1%) and scripting (39.1%).

Agents find their own paths. When agents do succeed, they often use completely different commands from the reference solution, with only 21.4% command overlap on average. The test suites check final state, not the path taken.

More tokens do not mean better results. Failed attempts consume 3.3x more tokens than successful ones.

How to use it

The dataset is on HuggingFace. Evaluation runs through Harbor, the same harness used by Terminal-Bench. See the Quickstart for step-by-step instructions.

If you run your model on TerminalWorld, we’d love to add your results to the leaderboard. Open a pull request on GitHub with your results.

What’s next

TerminalWorld is designed to grow. The pipeline can re-run as asciinema accumulates new recordings, keeping the benchmark aligned with how developers actually work. We’re planning to expand the dataset and continue growing the leaderboard as more models are evaluated.

The full paper is on arXiv.