fix: keep terminal mux persistent across navigation by whoisasx · Pull Request #325 · aoagents/ReverbCode

whoisasx · 2026-06-18T21:14:00Z

Summary

add a shell-owned terminal mux transport that reconnects across daemon/API base changes while preserving listeners
refactor visible terminal attachment to switch handles with explicit close/open frames instead of owning the websocket
harden backend mux data forwarding with an attachment identity guard and cover same-socket switching/cleanup behavior

Fixes #324 terminal transport track.

Tests

cd frontend && npm test -- terminal-mux useTerminalSession
cd frontend && npm run typecheck
cd backend && go test ./internal/terminal ./internal/httpd ./internal/adapters/runtime/zellij

Notes

Left the separate agent-selection PR track from Initial findings and follow-on iteration to find actual RC: Electron terminal distortion with Codex/zellij #324 untouched.

Copilot

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

whoisasx · 2026-06-20T10:28:18Z

Codex terminal/session staleness RCA findings

This is the consolidated read from multiple read-only subagents plus local code inspection and docs. Current conclusion: this is not one Codex flag problem. It is a layered terminal lifecycle issue, and Codex exposes it harder than Claude.

Most likely actual cause

AO is using zellij attach as the browser terminal transport. Every UI attach creates a real zellij client process. Every navigation/session switch can close that client and create another one.

That is fragile because:

Backend creates one zellij attach PTY per /mux connection/open: backend/internal/terminal/manager.go (openTerminal).
Frontend tears down /mux on terminal unmount/detach: frontend/src/renderer/hooks/useTerminalSession.ts (teardownMux).
Backend closes the attach by canceling context before graceful pty.Close(): backend/internal/terminal/attachment.go (close).
The PTY was started with Go exec.CommandContext: backend/internal/terminal/pty_unix.go. Context cancellation can kill the child process and bypass the intended graceful zellij detach path.

This matches the visible behavior: after navigation or resize churn, zellij can have stale/ghost client state, bad size state, or a new attach that is not actually ready. Then xterm shows a distorted terminal, and typing appears dead.

Why input stops

The backend sends opened before the PTY is actually attached:

backend/internal/terminal/manager.go sends opened before a.run(...) starts.
Frontend marks the terminal attached from that frame in frontend/src/renderer/hooks/useTerminalSession.ts.
If the user types before the PTY exists, backend drops it because attachment.write returns terminal: no active pty, and the manager currently ignores that write error.

So the UI can say “attached,” while input is not guaranteed to be wired yet.

Why AO thinks stale sessions are alive

AO liveness currently means “zellij session exists,” not “Codex process is alive”:

backend/internal/adapters/runtime/zellij/zellij.go uses zellij list-sessions in IsAlive.
The zellij launch wrapper runs the agent command and then falls back to an interactive shell: ...; exec <shell> -i in backend/internal/adapters/runtime/zellij/commands.go.

So if Codex exits/hangs, zellij can still be alive. AO sees “alive,” UI keeps showing the terminal, but the actual Codex agent may be gone or wedged.

Why Codex breaks more than Claude

Claude has better lifecycle semantics:

deterministic native session id,
workspace config hooks,
SessionEnd-style exit signaling,
simpler launch args.

Codex is more fragile here because:

Codex hooks/trust are injected through many CLI -c launch flags.
Codex restore depends on persisted agentSessionId, which is not reliably available today.
Codex lacks a strong SessionEnd signal, so AO falls back to zellij liveness.
Codex TUI probes terminal state/color/events. We observed process samples blocked inside crossterm terminal reads such as terminal palette/event reads.

So Claude can survive terminal churn better. Codex gets stuck because it is more sensitive to terminal client state and AO cannot accurately distinguish “Codex alive” from “zellij pane alive.”

Why previous fixes did not fully solve it

Persistent /mux reduced reconnect churn, but did not fix the underlying zellij attach client lifecycle.
Codex color flags reduced one terminal probe path, but not all Codex TUI terminal reads.
Worktree stale-registration fix solved a real bug, but it was only one failure mode.
--no-alt-screen, NO_COLOR=1, and blank COLORTERM are mitigations, not a fix for the attach/detach architecture.

Confirmed related bugs/contributors found during investigation

Stale Git worktree registration bug
- Git could keep a prunable worktree registration after AO worktree dirs were removed.
- AO trusted git worktree list and marked sessions spawned with paths that did not exist.
- Zellij opened, Codex failed immediately, and the wrapper fell back to shell.
- Fix direction already identified: verify registered worktree paths exist, prune stale Git metadata, recreate worktree.
opened is not input-ready
- Backend sends opened before the attach PTY is actually spawned.
- Input during that window is dropped.
Close ordering can create ghost zellij clients
- attachment.close cancels context before pty.Close().
- CommandContext can kill the zellij attach process before it gracefully deregisters.
- Code comments in pty_unix.go already note that dead-but-registered clients can pin session size.
Overlapping attach clients can distort size
- Backend dedupes attaches only within one WebSocket connection, not across remounts/connections.
- Rapid navigation can create overlap between old and new zellij attach clients.
Zellij liveness hides dead Codex
- IsAlive checks zellij session, not foreground agent process.
- Shell fallback after agent exit makes the zellij session look live.
Codex has weaker exit metadata
- Claude has SessionEnd-style behavior.
- Codex activity mapping only covers prompt/permission/stop and relies heavily on reaper/zellij liveness.
waiting_input can be sticky
- A Codex permission hook can put a session in waiting_input.
- Sticky/recent activity can prevent reaper-driven termination from surfacing clearly.
UI can mask no_signal
- Backend can derive no_signal, but frontend status mapping can treat unknown live statuses as working.
delete-session --force errors are too broadly swallowed
- Zellij Destroy treats any exec.ExitError as success, which can hide real deletion failures and leave resurrection metadata behind.

External docs alignment

Zellij documents session resurrection and attach behavior; exited sessions can be resurrected by attach unless serialization is disabled/deleted: https://zellij.dev/documentation/session-resurrection.html
Zellij options include session serialization controls: https://zellij.dev/documentation/options.html
Go os/exec CommandContext ties process lifetime to context cancellation; cancellation can kill the process rather than letting app-specific graceful cleanup run: https://pkg.go.dev/os/exec
xterm/fitted browser terminals require resize propagation back to the PTY; AO does this, but ghost/overlapping zellij clients can still pin bad sizes: https://xtermjs.org/docs/guides/using-addons/

Recommended fix direction

Stop treating zellij attach as a disposable browser transport.

Concrete short-term fixes:

Make terminal attach close graceful: close PTY first, then cancel context.
Send opened only after the PTY is actually spawned and ready.
Buffer input until PTY exists instead of dropping it.
Track whether the foreground agent process is still Codex, not just whether zellij exists.
Remove/change ; exec shell -i for agent sessions, or explicitly detect “agent exited, shell fallback only.”
Tighten zellij Destroy error handling so only known “already gone” cases are swallowed.
Surface no_signal in frontend instead of mapping it to working.

Longer-term fix:

Use one stable backend terminal attachment per session and let frontend clients subscribe to it, instead of spawning new zellij attach clients on every UI mount/navigation. Alternatively, use zellij’s web/client capability or a PTY architecture where the browser socket is not the owner of the underlying terminal client lifecycle.

whoisasx · 2026-06-20T12:03:25Z

Terminal attachment lifecycle getting out of sync with route/session switching

The current evidence points to the terminal viewer attachment going stale, not the underlying Codex/zellij session dying.

The actual Codex/zellij sessions were still alive when checked through AO's zellij socket:

ZELLIJ_SOCKET_DIR=/tmp/ao-zellij-501 zellij list-sessions --no-formatting

Both sessions were active there. The misleading EXITED output came from running plain zellij ls without AO's socket dir.

The stale path is the viewer attachment chain:

React route/session view
  -> TerminalPane
  -> useTerminalSession()
  -> /mux WebSocket
  -> backend terminal manager
  -> zellij attach <session>
  -> zellij pane running Codex

The important mismatch is:

TerminalPane keeps/reuses the xterm instance across route/session switching.
When the route changes or the terminal component unmounts, useTerminalSession cleanup calls teardownMux().
teardownMux() closes the /mux WebSocket.
Backend cleanup then closes that specific zellij attach viewer process.
The real orchestrator session keeps running, but the UI terminal can now be only an old xterm buffer with no live attach behind it.
When the user navigates back, the frontend must create a brand-new /mux connection and a brand-new zellij attach.
If that reattach is missed, delayed, races with cleanup, or attaches under a stale handle generation, the terminal looks alive but is not actually connected.

That matches the symptom: typing does not appear in the query panel. xterm does not locally echo input; typed characters only show when the backend attach accepts input and the terminal app redraws. If /mux -> zellij attach is gone, typing appears to do nothing.

Runtime evidence from the app showed short-lived mux sockets around navigation/session switching:

/mux status=101 duration=16s
/mux status=101 duration=15s
/mux status=101 duration=9s
/mux status=101 duration=14s

Process state also showed only one active frontend viewer attach at the time:

zellij attach reverbcode-1 options --pane-frames false

There was no matching active zellij attach for test-reverb-1, even though the underlying test-reverb-1 Codex/zellij session was still alive.

So the current RCA is:

session lifetime != terminal viewer lifetime

The daemon can correctly report the session as alive because it is alive. But the frontend terminal can still be detached because its /mux attachment was destroyed during navigation/session switching and not reliably recreated for the currently visible session.

whoisasx · 2026-06-20T20:38:17Z

Current RCA from terminal-flow logs

I checked the fresh diagnostic log after both Codex orchestrator sessions became stale-looking. The evidence points away from daemon/zellij process death and toward terminal-control bytes being forwarded as user input into Codex.

What the logs/process state show:

Both sessions are alive in the daemon: idle, isTerminated:false.
Both zellij panes are alive: exited:false.
Both Codex processes are still running.
test-reverb-1 currently has no zellij client attached.
interview-memvid-1 has one zellij client attached from the app.
User input reached backend and was written into the PTY, but Codex stopped responding normally after that.

The strongest signal is the pane dump. Both Codex prompts had garbage text inserted:

› 1c1/4a4a

That does not look like human input. It looks like leaked terminal-control / terminal-query response residue. The diagnostic log also shows a large input burst immediately after attach:

test-reverb-1:        646 mux.client.input frames
interview-memvid-1:   654 mux.client.input frames

Most of those frames are 27/28-byte chunks, not typed user text. The likely flow is:

xterm receives terminal probe/control sequences
→ xterm emits response bytes through onData
→ frontend treats every onData as user input
→ /mux forwards them to backend
→ backend writes them into zellij attach PTY
→ those bytes leak into Codex prompt
→ Codex TUI/query panel becomes corrupted/stale-looking

Why navigation/two sessions makes it worse: route switches close/reopen /mux, so each reconnect repeats the zellij/Codex attach handshake and repeats this input storm. That leaves the Codex prompt polluted. test-reverb was detached at the time of inspection; interview-memvid was still attached but already polluted.

Current conclusion: this is not primarily a dead daemon/session problem. The root issue appears to be that terminal-generated control-response bytes from the web terminal are being forwarded as interactive input through zellij attach, and Codex is more sensitive to that than Claude. The stale UI is a symptom of Codex TUI corruption after those bytes are injected.

Follow-up needed: classify the outbound mux.client.input payloads without logging sensitive raw user text, then filter or suppress terminal-generated control/mouse/report responses during attach/reconnect while preserving real keyboard input.

whoisasx · 2026-06-20T22:23:49Z

Update after the logging-led RCA and fix:

We added end-to-end terminal flow logging across the renderer, mux client, backend /mux, terminal attachment, PTY lifecycle, and zellij attach path. That let us stop guessing and compare the actual stale/corrupt runs against a fresh run.

Finding from the logs: the zellij sessions and daemon were not dying. The bad signature was terminal-generated/control bytes being sent back through the raw xterm input path as if they were user keystrokes. In the failing runs we saw attach-time input storms with repeated non-human-sized chunks, and those bytes leaked into the Codex prompt. That explained why the UI looked stale/corrupt even though the runtime sessions were still alive.

Decision taken: do not forward raw xterm onData as user input. The terminal now forwards only explicit user-originated input sources: keyboard, paste, and composition. We also gate input until the backend confirms the attach PTY is actually open, guard stale mux callbacks with attachment generations, buffer early backend input until PTY readiness, and make zellij attach quieter for the embedded client.

Current validation: after a fresh run with multiple projects/sessions/workers, the diagnostic log shows only real keyboard input (source=keyboard, bytes=1) being forwarded. The old 27/28-byte attach-time input storm is absent. The daemon API reports sessions alive, AO's isolated zellij namespace reports the sessions alive, and the app is now behaving correctly in the tested flow.

Committed and pushed in b25e8d8 (fix: harden terminal mux attachment flow).

whoisasx and others added 2 commits June 19, 2026 02:43

fix: keep terminal mux persistent across navigation

4cd8ea8

chore: format with prettier [skip ci]

7b66258

whoisasx requested a review from Copilot June 18, 2026 21:15

Copilot AI reviewed Jun 18, 2026

whoisasx requested review from AgentWrapper and harshitsinghbhandari June 18, 2026 21:19

whoisasx and others added 2 commits June 19, 2026 03:17

fix: require configured agent defaults

018a339

chore: format with prettier [skip ci]

623fe61

whoisasx marked this pull request as draft June 18, 2026 21:52

whoisasx added 5 commits June 19, 2026 03:33

fix: recreate stale orchestrator worktrees

5ea22d3

fix: stop terminal reattach on socket close

8e6f9ad

fix: stop terminal attach loop on close

efa8fdb

fix: keep agent default selects controlled

e36d8d5

revert: undo persistent mux branch changes

e760507

fix: harden terminal mux attachment flow

b25e8d8

github-actions Bot and others added 4 commits June 20, 2026 22:23

chore: format with prettier [skip ci]

0ac88bd

chore: remove terminal flow diagnostics

c551b45

chore: format with prettier [skip ci]

16e2395

Merge origin/main into fix/persistent-terminal-mux

5fa0a2c

whoisasx marked this pull request as ready for review June 20, 2026 23:14

fix: satisfy zellij command lint

31a0047

whoisasx merged commit 5e8c8de into main Jun 21, 2026
10 checks passed

areycruzer mentioned this pull request Jun 21, 2026

Frontend Playwright e2e suite is stale and can run against the wrong renderer server #357

Open

i-trytoohard mentioned this pull request Jun 21, 2026

fix(terminal): enable zellij mouse mode so wheel scroll works #363

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: keep terminal mux persistent across navigation#325

fix: keep terminal mux persistent across navigation#325
whoisasx merged 15 commits into
mainfrom
fix/persistent-terminal-mux

whoisasx commented Jun 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

whoisasx commented Jun 20, 2026

Uh oh!

whoisasx commented Jun 20, 2026

Uh oh!

whoisasx commented Jun 20, 2026

Uh oh!

whoisasx commented Jun 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

whoisasx commented Jun 18, 2026

Summary

Tests

Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

whoisasx commented Jun 20, 2026

Codex terminal/session staleness RCA findings

Most likely actual cause

Why input stops

Why AO thinks stale sessions are alive

Why Codex breaks more than Claude

Why previous fixes did not fully solve it

Confirmed related bugs/contributors found during investigation

External docs alignment

Recommended fix direction

Uh oh!

whoisasx commented Jun 20, 2026

Terminal attachment lifecycle getting out of sync with route/session switching

Uh oh!

whoisasx commented Jun 20, 2026

Current RCA from terminal-flow logs

Uh oh!

whoisasx commented Jun 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants