ECHO: Terminal Agents Learn World Models for Free

Co-written by Dimitris Papailiopoulos and Vaishnavi Shrivastava.

<aside> 🤖

We taught CLI agents to predict terminal responses during RL, alongside the usual GRPO loss on actions. The change is tiny: same rollout, same forward pass, but stop masking out terminal-output tokens. The effect is huge: all evals improve, and the resulting models measurably learn how the terminal behaves.

CLI agents can learn a terminal model for free — and use it to act better!

This is ECHO: a hybrid objective that trains on both sides of the interaction: what the agent writes, and what the terminal writes back.

</aside>

If you’re too busy to read this whole post, here’s what we found:

Standard agent RL throws away the environment’s response. GRPO trains on action tokens and masks out terminal responses, even though they are already in context, already pass through the model, and are ground truth signals on how the agent's actions affected the environment.
ECHO fixes this by training on both sides of the interaction. It keeps the usual GRPO loss on action tokens, and adds a simple environment cross-entropy loss on terminal-output tokens. It’s a few LoCs on top of any GRPO trainer. Same rollout and forward pass, just a different mask over the logits.
ECHO works, and it’s free! ECHO improves Qwen3-8B, OpenThinker-Agent-v1-SFT, and Qwen3-14B across every benchmark we tested. ECHO also trains up to 2.3× faster to the same performance. TerminalBench-2.0 pass@1 nearly doubles at both 8B (2.7 → 5.2) and 14B (5.2 → 10.8).
ECHO teaches terminal dynamics! On held-out trajectories, environment-token cross-entropy drops sharply with ECHO and barely moves with plain GRPO. Direct evidence that ECHO teaches the model how the terminal actually responds. The same checkpoints that better predict terminal outputs also solve more tasks.
ECHO can substitute for an expert teacher. From a base Qwen3-8B with no expert demonstrations, ECHO nearly matches what GRPO after SFT on expert demonstrations achieves.
ECHO lets agents self-improve without verifier rewards! Without any verifier rewards, ECHO (without any GRPO) allows the agent to further improve just by acting in the environment and predicting what happens.

This started as a simple question: if every command produces a terminal response, why does RL train only on the command?

Vaish did all the work to figure it out. I contributed a silly maze experiment*, a strong opinion on the title, and saying "holy shit" when she showed me the first result. Thanks to Ahmed Awadallah for giving us space — and GPUs — to chase ideas like this, even when they start as just a research itch.

Worth noting: the first cluster run for ECHO launched March 29 😊

This work was done at AI Frontiers, a boutique research lab inside Microsoft Research.

How do you continually learn?

The first time this idea showed up it was motivated by a simple question on self-improvement and continual learning. How does an agent get better just from acting in the world?

Vaish and I had been talking since last Fall about self-improvement in CLI agents, i.e., what it would mean to get better from interacting with the environment (aka the terminal), especially without a verifier.

Verifier-free RL is a problem people have worked on for years, and most of the attempts at it face the same issue: where does supervision come from, if there’s no reward?