NEW · arXiv 2605.22535
TerminalWorld logo

TerminalWorld

Real-World Tasks.  Real Impact.

Benchmark AI agents on the terminal workflows developers run every day. Live as practices evolve.

bash - task #417298 cloud-infrastructure

1,530

Full Benchmark

200

Human Verified

18

Task categories

1,280

Unique commands

Leaderboard

Updated May 21, 2026

Full leaderboard →
🥇

Claude Opus 4.7

Anthropic

62.5%
🥈

Kimi K2.6

Moonshot AI

57.5%
🥉

GLM 5.1

Z.ai

57.0%
4

Gemini 3.1 Pro

Google

55.0%
5

Qwen3.6-Max-Preview

Alibaba

54.0%
6

GPT-5.5

OpenAI

53.5%
7

DeepSeek-V4-Pro

DeepSeek

50.0%
8

MiniMax M2.7

MiniMax

49.0%

Terminus-2 agent · 200 verified tasks Submit your model →

Task Coverage & Examples

18 categories spanning the full breadth of real terminal work.

Browse all tasks →

Scripting & Automation

350
Example Very Hard Verified
asciinema ↗

#739272

Demonstrate that multiple programming languages are installed and functional on the system by running a "Hello World" program in each language and collecting…

apt-getpython3hola_cppjavacjava

Software Build & Test

212
Example Very Hard
asciinema ↗

#359207

Compile and install the Janus WebRTC gateway along with all required dependencies from source. The build environment is a Debian 10 system with the necessary…

gitcmakemakemesonninja

System Administration

191
Example Medium Verified
asciinema ↗

#473888

Install and configure an OpenSSH server so that it listens on port 3000 instead of the default port 22. SELinux must be disabled (set to `disabled` in…

sudogrep

Containers & Orchestration

184
Example Hard Verified
asciinema ↗

#139853

Deploy an event stream analytics pipeline in an OpenShift cluster. When complete, the following resources must exist: 1. A `ConfigMap` named…

oc

Security

126
Example Medium Verified
asciinema ↗

#448247

Verify the authenticity and integrity of the Asus KGPE-D16 Dasharo Release v0.1.0 firmware image. This involves obtaining the appropriate GPG keys from the…

gpgwgetsha256sum

Environment Setup

100
Example Hard
asciinema ↗

#366394

Set up the `aafm` (Automated Analysis of Feature Models) framework from the `diverso-lab` project. Clone the `core`, `fm_metamodel`, and `pysat_metamodel`…

gitpython3pipcp

Version Control

70
Example Hard Verified
asciinema ↗

#694892

The repository at `/app` has `git nomad` installed and configured to simulate two hosts: `desktop` and `laptop`. Each invocation of `git nomad sync`…

git

Database Operations

52
Example Medium
asciinema ↗

#542219

Set up a CockroachDB role hierarchy and produce a verification file at `/app/result.txt`. A CockroachDB cluster is accessible at…

cockroach

Data Analysis

39
Example Hard Verified
asciinema ↗

#241711

Download James Joyce's *Ulysses* from Project Gutenberg (`http://www.gutenberg.org/files/4300/4300-0.txt`) and perform an n-gram frequency analysis on its…

wgetrmtrngramsort

Debugging & Testing

36
Example Hard
asciinema ↗

#224933

Investigate why a Docker Swarm service fails to deploy and document your findings. A service named `alertmanager` has been created using the…

dockeralertmanagerserviceheadinspect

Networking

34
Example Medium
asciinema ↗

#583258

Configure a network namespace named `runc` and connect it to a host bridge using a virtual ethernet (veth) pair. The host must have an active bridge named…

ipbrctl

Scientific Computing

30
Example Hard Verified
asciinema ↗

#347571

Use TACT (Taxonomic Addition for Complete Trees) via its Docker image to process phylogenetic data for the Carangaria fish group. Example input files…

curldockercreatetact_build_taxonomic_treetact_add_taxa

File & Storage

29
Example Hard Verified
asciinema ↗

#299387

Using Kopia, back up the directory `~/Projects/Kopia` to a Google Cloud Storage bucket named `kopia-demo-1`. Create a Kopia repository in that bucket,…

gsutilkopiauuidgenrm

Cloud & Infrastructure

26
Example Medium
asciinema ↗

#417298

Provision the Google Cloud infrastructure required to bootstrap a Kubernetes cluster from scratch. This involves creating a custom VPC network named…

gcloud

ML Training & Experiments

18
Example Hard
asciinema ↗

#668460

Clone `https://github.com/saforem2/ezpz` into `/app/ezpz` and `https://github.com/saforem2/wordplay` into `/app/wordplay`. Set up a Python virtual environment…

gitpython3pipmpirun

Deployment & CI/CD

14
Example Hard
asciinema ↗

#169462

The environment at `/app` is configured with a YourBase (`yb`) project for the `hamilton` service. Use the `yb` CLI to capture service logs and performance…

ybcurlgit

Performance Optimization

14
Example Medium
asciinema ↗

#692136

Measure end-to-end Kafka message latency in two configurations and save the results to output files in `/app/`. The environment in `/app/` includes a Docker…

dockerkafka-topicskafka-run-classphysical-kafkavia-gateway

Media Processing

5
Example Medium
asciinema ↗

#104869

Convert a set of stereo WAV files to mono using FFmpeg. The audio files are located in an archive at `/tmp/original_wav.tar.gz`. Extract the archive and…

tarfindffmpegmvsort

How We Built It

A reverse-engineering pipeline from 80,870 raw recordings to 1,530 automatically validated tasks and 200 human-verified tasks.

01Recording Collection

80,870

Asciinema recordings

Downloaded via public asciinema links

02Recording Filtering

9,492

Filtered recordings

Filtered by excluding PII, inaccessible resources, TUI/GUI & low quality

03Task Synthesis

9,492

Tasks with synthesized instructions and reference solutions

LLM-distilled from recording transcripts

04Environment Reproduction

5,035

Tasks with reproduced environments

Validated by replaying recordings

05Test Generation

1,530

Tasks with test suites

Validated via three execution trials

Full Benchmark

06Human Verification

200

Human-verified tasks

Manually executed and cross-reviewed by authors

Verified Subset
arXiv

TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks

arXiv 2605.22535 · 2026

Questions or collaboration? [email protected]