
Co-written by Dimitris Papailiopoulos and Vaishnavi Shrivastava.
<aside> đ¤
We taught CLI agents to predict terminal responses during RL, alongside the usual GRPO loss on actions. The change is tiny: same rollout, same forward pass, but stop masking out terminal-output tokens. The effect is huge: all evals improve, and the resulting models measurably learn how the terminal behaves.
CLI agents can learn a terminal model for free â and use it to act better!
This is ECHO: a hybrid objective that trains on both sides of the interaction: what the agent writes, and what the terminal writes back.
</aside>

If youâre too busy to read this whole post, hereâs what we found:
This started as a simple question: if every command produces a terminal response, why does RL train only on the command?
Vaish did all the work to figure it out. I contributed a silly maze experiment*, a strong opinion on the title, and saying "holy shit" when she showed me the first result. Thanks to Ahmed Awadallah for giving us space â and GPUs â to chase ideas like this, even when they start as just a research itch.
Worth noting: the first cluster run for ECHO launched March 29 đ
This work was done at AI Frontiers, a boutique research lab inside Microsoft Research.
The first time this idea showed up it was motivated by a simple question on self-improvement and continual learning. How does an agent get better just from acting in the world?
Vaish and I had been talking since last Fall about self-improvement in CLI agents, i.e., what it would mean to get better from interacting with the environment (aka the terminal), especially without a verifier.
Verifier-free RL is a problem people have worked on for years, and most of the attempts at it face the same issue: where does supervision come from, if thereâs no reward?