fix: keep terminal mux persistent across navigation#325
Conversation
Codex terminal/session staleness RCA findingsThis is the consolidated read from multiple read-only subagents plus local code inspection and docs. Current conclusion: this is not one Codex flag problem. It is a layered terminal lifecycle issue, and Codex exposes it harder than Claude. Most likely actual causeAO is using That is fragile because:
This matches the visible behavior: after navigation or resize churn, zellij can have stale/ghost client state, bad size state, or a new attach that is not actually ready. Then xterm shows a distorted terminal, and typing appears dead. Why input stopsThe backend sends
So the UI can say “attached,” while input is not guaranteed to be wired yet. Why AO thinks stale sessions are aliveAO liveness currently means “zellij session exists,” not “Codex process is alive”:
So if Codex exits/hangs, zellij can still be alive. AO sees “alive,” UI keeps showing the terminal, but the actual Codex agent may be gone or wedged. Why Codex breaks more than ClaudeClaude has better lifecycle semantics:
Codex is more fragile here because:
So Claude can survive terminal churn better. Codex gets stuck because it is more sensitive to terminal client state and AO cannot accurately distinguish “Codex alive” from “zellij pane alive.” Why previous fixes did not fully solve it
Confirmed related bugs/contributors found during investigation
External docs alignment
Recommended fix directionStop treating Concrete short-term fixes:
Longer-term fix: Use one stable backend terminal attachment per session and let frontend clients subscribe to it, instead of spawning new |
Terminal attachment lifecycle getting out of sync with route/session switchingThe current evidence points to the terminal viewer attachment going stale, not the underlying Codex/zellij session dying. The actual Codex/zellij sessions were still alive when checked through AO's zellij socket: ZELLIJ_SOCKET_DIR=/tmp/ao-zellij-501 zellij list-sessions --no-formattingBoth sessions were active there. The misleading The stale path is the viewer attachment chain: The important mismatch is:
That matches the symptom: typing does not appear in the query panel. xterm does not locally echo input; typed characters only show when the backend attach accepts input and the terminal app redraws. If Runtime evidence from the app showed short-lived mux sockets around navigation/session switching: Process state also showed only one active frontend viewer attach at the time: There was no matching active So the current RCA is: The daemon can correctly report the session as alive because it is alive. But the frontend terminal can still be detached because its |
Current RCA from terminal-flow logsI checked the fresh diagnostic log after both Codex orchestrator sessions became stale-looking. The evidence points away from daemon/zellij process death and toward terminal-control bytes being forwarded as user input into Codex. What the logs/process state show:
The strongest signal is the pane dump. Both Codex prompts had garbage text inserted: That does not look like human input. It looks like leaked terminal-control / terminal-query response residue. The diagnostic log also shows a large input burst immediately after attach: Most of those frames are 27/28-byte chunks, not typed user text. The likely flow is: Why navigation/two sessions makes it worse: route switches close/reopen Current conclusion: this is not primarily a dead daemon/session problem. The root issue appears to be that terminal-generated control-response bytes from the web terminal are being forwarded as interactive input through Follow-up needed: classify the outbound |
|
Update after the logging-led RCA and fix: We added end-to-end terminal flow logging across the renderer, mux client, backend Finding from the logs: the zellij sessions and daemon were not dying. The bad signature was terminal-generated/control bytes being sent back through the raw xterm input path as if they were user keystrokes. In the failing runs we saw attach-time input storms with repeated non-human-sized chunks, and those bytes leaked into the Codex prompt. That explained why the UI looked stale/corrupt even though the runtime sessions were still alive. Decision taken: do not forward raw xterm Current validation: after a fresh run with multiple projects/sessions/workers, the diagnostic log shows only real keyboard input ( Committed and pushed in |
Summary
Fixes #324 terminal transport track.
Tests
Notes