image.png

Co-written by Dimitris Papailiopoulos and Vaishnavi Shrivastava.

<aside> 🤖

We taught CLI agents to predict terminal responses during RL, alongside the usual GRPO loss on actions. The change is tiny: same rollout, same forward pass, but stop masking out terminal-output tokens. The effect is huge: all evals improve, and the resulting models measurably learn how the terminal behaves.

CLI agents can learn a terminal model for free — and use it to act better!

This is ECHO: a hybrid objective that trains on both sides of the interaction: what the agent writes, and what the terminal writes back.

</aside>

fig8_release_anim_arrow_4.gif

If you’re too busy to read this whole post, here’s what we found:

This started as a simple question: if every command produces a terminal response, why does RL train only on the command?

Vaish did all the work to figure it out. I contributed a silly maze experiment*, a strong opinion on the title, and saying "holy shit" when she showed me the first result. Thanks to Ahmed Awadallah for giving us space — and GPUs — to chase ideas like this, even when they start as just a research itch.

Worth noting: the first cluster run for ECHO launched March 29 😊

This work was done at AI Frontiers, a boutique research lab inside Microsoft Research.

How do you continually learn?

The first time this idea showed up it was motivated by a simple question on self-improvement and continual learning. How does an agent get better just from acting in the world?

Vaish and I had been talking since last Fall about self-improvement in CLI agents, i.e., what it would mean to get better from interacting with the environment (aka the terminal), especially without a verifier.

Verifier-free RL is a problem people have worked on for years, and most of the attempts at it face the same issue: where does supervision come from, if there’s no reward?