Skip to content

fix: keep terminal mux persistent across navigation#325

Merged
whoisasx merged 15 commits into
mainfrom
fix/persistent-terminal-mux
Jun 21, 2026
Merged

fix: keep terminal mux persistent across navigation#325
whoisasx merged 15 commits into
mainfrom
fix/persistent-terminal-mux

Conversation

@whoisasx

Copy link
Copy Markdown
Collaborator

Summary

  • add a shell-owned terminal mux transport that reconnects across daemon/API base changes while preserving listeners
  • refactor visible terminal attachment to switch handles with explicit close/open frames instead of owning the websocket
  • harden backend mux data forwarding with an attachment identity guard and cover same-socket switching/cleanup behavior

Fixes #324 terminal transport track.

Tests

  • cd frontend && npm test -- terminal-mux useTerminalSession
  • cd frontend && npm run typecheck
  • cd backend && go test ./internal/terminal ./internal/httpd ./internal/adapters/runtime/zellij

Notes

@whoisasx whoisasx requested a review from Copilot June 18, 2026 21:15

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

@whoisasx whoisasx marked this pull request as draft June 18, 2026 21:52
@whoisasx

Copy link
Copy Markdown
Collaborator Author

Codex terminal/session staleness RCA findings

This is the consolidated read from multiple read-only subagents plus local code inspection and docs. Current conclusion: this is not one Codex flag problem. It is a layered terminal lifecycle issue, and Codex exposes it harder than Claude.

Most likely actual cause

AO is using zellij attach as the browser terminal transport. Every UI attach creates a real zellij client process. Every navigation/session switch can close that client and create another one.

That is fragile because:

  1. Backend creates one zellij attach PTY per /mux connection/open: backend/internal/terminal/manager.go (openTerminal).
  2. Frontend tears down /mux on terminal unmount/detach: frontend/src/renderer/hooks/useTerminalSession.ts (teardownMux).
  3. Backend closes the attach by canceling context before graceful pty.Close(): backend/internal/terminal/attachment.go (close).
  4. The PTY was started with Go exec.CommandContext: backend/internal/terminal/pty_unix.go. Context cancellation can kill the child process and bypass the intended graceful zellij detach path.

This matches the visible behavior: after navigation or resize churn, zellij can have stale/ghost client state, bad size state, or a new attach that is not actually ready. Then xterm shows a distorted terminal, and typing appears dead.

Why input stops

The backend sends opened before the PTY is actually attached:

  • backend/internal/terminal/manager.go sends opened before a.run(...) starts.
  • Frontend marks the terminal attached from that frame in frontend/src/renderer/hooks/useTerminalSession.ts.
  • If the user types before the PTY exists, backend drops it because attachment.write returns terminal: no active pty, and the manager currently ignores that write error.

So the UI can say “attached,” while input is not guaranteed to be wired yet.

Why AO thinks stale sessions are alive

AO liveness currently means “zellij session exists,” not “Codex process is alive”:

  • backend/internal/adapters/runtime/zellij/zellij.go uses zellij list-sessions in IsAlive.
  • The zellij launch wrapper runs the agent command and then falls back to an interactive shell: ...; exec <shell> -i in backend/internal/adapters/runtime/zellij/commands.go.

So if Codex exits/hangs, zellij can still be alive. AO sees “alive,” UI keeps showing the terminal, but the actual Codex agent may be gone or wedged.

Why Codex breaks more than Claude

Claude has better lifecycle semantics:

  • deterministic native session id,
  • workspace config hooks,
  • SessionEnd-style exit signaling,
  • simpler launch args.

Codex is more fragile here because:

  • Codex hooks/trust are injected through many CLI -c launch flags.
  • Codex restore depends on persisted agentSessionId, which is not reliably available today.
  • Codex lacks a strong SessionEnd signal, so AO falls back to zellij liveness.
  • Codex TUI probes terminal state/color/events. We observed process samples blocked inside crossterm terminal reads such as terminal palette/event reads.

So Claude can survive terminal churn better. Codex gets stuck because it is more sensitive to terminal client state and AO cannot accurately distinguish “Codex alive” from “zellij pane alive.”

Why previous fixes did not fully solve it

  • Persistent /mux reduced reconnect churn, but did not fix the underlying zellij attach client lifecycle.
  • Codex color flags reduced one terminal probe path, but not all Codex TUI terminal reads.
  • Worktree stale-registration fix solved a real bug, but it was only one failure mode.
  • --no-alt-screen, NO_COLOR=1, and blank COLORTERM are mitigations, not a fix for the attach/detach architecture.

Confirmed related bugs/contributors found during investigation

  1. Stale Git worktree registration bug

    • Git could keep a prunable worktree registration after AO worktree dirs were removed.
    • AO trusted git worktree list and marked sessions spawned with paths that did not exist.
    • Zellij opened, Codex failed immediately, and the wrapper fell back to shell.
    • Fix direction already identified: verify registered worktree paths exist, prune stale Git metadata, recreate worktree.
  2. opened is not input-ready

    • Backend sends opened before the attach PTY is actually spawned.
    • Input during that window is dropped.
  3. Close ordering can create ghost zellij clients

    • attachment.close cancels context before pty.Close().
    • CommandContext can kill the zellij attach process before it gracefully deregisters.
    • Code comments in pty_unix.go already note that dead-but-registered clients can pin session size.
  4. Overlapping attach clients can distort size

    • Backend dedupes attaches only within one WebSocket connection, not across remounts/connections.
    • Rapid navigation can create overlap between old and new zellij attach clients.
  5. Zellij liveness hides dead Codex

    • IsAlive checks zellij session, not foreground agent process.
    • Shell fallback after agent exit makes the zellij session look live.
  6. Codex has weaker exit metadata

    • Claude has SessionEnd-style behavior.
    • Codex activity mapping only covers prompt/permission/stop and relies heavily on reaper/zellij liveness.
  7. waiting_input can be sticky

    • A Codex permission hook can put a session in waiting_input.
    • Sticky/recent activity can prevent reaper-driven termination from surfacing clearly.
  8. UI can mask no_signal

    • Backend can derive no_signal, but frontend status mapping can treat unknown live statuses as working.
  9. delete-session --force errors are too broadly swallowed

    • Zellij Destroy treats any exec.ExitError as success, which can hide real deletion failures and leave resurrection metadata behind.

External docs alignment

Recommended fix direction

Stop treating zellij attach as a disposable browser transport.

Concrete short-term fixes:

  1. Make terminal attach close graceful: close PTY first, then cancel context.
  2. Send opened only after the PTY is actually spawned and ready.
  3. Buffer input until PTY exists instead of dropping it.
  4. Track whether the foreground agent process is still Codex, not just whether zellij exists.
  5. Remove/change ; exec shell -i for agent sessions, or explicitly detect “agent exited, shell fallback only.”
  6. Tighten zellij Destroy error handling so only known “already gone” cases are swallowed.
  7. Surface no_signal in frontend instead of mapping it to working.

Longer-term fix:

Use one stable backend terminal attachment per session and let frontend clients subscribe to it, instead of spawning new zellij attach clients on every UI mount/navigation. Alternatively, use zellij’s web/client capability or a PTY architecture where the browser socket is not the owner of the underlying terminal client lifecycle.

@whoisasx

Copy link
Copy Markdown
Collaborator Author

Terminal attachment lifecycle getting out of sync with route/session switching

The current evidence points to the terminal viewer attachment going stale, not the underlying Codex/zellij session dying.

The actual Codex/zellij sessions were still alive when checked through AO's zellij socket:

ZELLIJ_SOCKET_DIR=/tmp/ao-zellij-501 zellij list-sessions --no-formatting

Both sessions were active there. The misleading EXITED output came from running plain zellij ls without AO's socket dir.

The stale path is the viewer attachment chain:

React route/session view
  -> TerminalPane
  -> useTerminalSession()
  -> /mux WebSocket
  -> backend terminal manager
  -> zellij attach <session>
  -> zellij pane running Codex

The important mismatch is:

  1. TerminalPane keeps/reuses the xterm instance across route/session switching.
  2. When the route changes or the terminal component unmounts, useTerminalSession cleanup calls teardownMux().
  3. teardownMux() closes the /mux WebSocket.
  4. Backend cleanup then closes that specific zellij attach viewer process.
  5. The real orchestrator session keeps running, but the UI terminal can now be only an old xterm buffer with no live attach behind it.
  6. When the user navigates back, the frontend must create a brand-new /mux connection and a brand-new zellij attach.
  7. If that reattach is missed, delayed, races with cleanup, or attaches under a stale handle generation, the terminal looks alive but is not actually connected.

That matches the symptom: typing does not appear in the query panel. xterm does not locally echo input; typed characters only show when the backend attach accepts input and the terminal app redraws. If /mux -> zellij attach is gone, typing appears to do nothing.

Runtime evidence from the app showed short-lived mux sockets around navigation/session switching:

/mux status=101 duration=16s
/mux status=101 duration=15s
/mux status=101 duration=9s
/mux status=101 duration=14s

Process state also showed only one active frontend viewer attach at the time:

zellij attach reverbcode-1 options --pane-frames false

There was no matching active zellij attach for test-reverb-1, even though the underlying test-reverb-1 Codex/zellij session was still alive.

So the current RCA is:

session lifetime != terminal viewer lifetime

The daemon can correctly report the session as alive because it is alive. But the frontend terminal can still be detached because its /mux attachment was destroyed during navigation/session switching and not reliably recreated for the currently visible session.

@whoisasx

Copy link
Copy Markdown
Collaborator Author

Current RCA from terminal-flow logs

I checked the fresh diagnostic log after both Codex orchestrator sessions became stale-looking. The evidence points away from daemon/zellij process death and toward terminal-control bytes being forwarded as user input into Codex.

What the logs/process state show:

  • Both sessions are alive in the daemon: idle, isTerminated:false.
  • Both zellij panes are alive: exited:false.
  • Both Codex processes are still running.
  • test-reverb-1 currently has no zellij client attached.
  • interview-memvid-1 has one zellij client attached from the app.
  • User input reached backend and was written into the PTY, but Codex stopped responding normally after that.

The strongest signal is the pane dump. Both Codex prompts had garbage text inserted:

› 1c1/4a4a

That does not look like human input. It looks like leaked terminal-control / terminal-query response residue. The diagnostic log also shows a large input burst immediately after attach:

test-reverb-1:        646 mux.client.input frames
interview-memvid-1:   654 mux.client.input frames

Most of those frames are 27/28-byte chunks, not typed user text. The likely flow is:

xterm receives terminal probe/control sequences
→ xterm emits response bytes through onData
→ frontend treats every onData as user input
→ /mux forwards them to backend
→ backend writes them into zellij attach PTY
→ those bytes leak into Codex prompt
→ Codex TUI/query panel becomes corrupted/stale-looking

Why navigation/two sessions makes it worse: route switches close/reopen /mux, so each reconnect repeats the zellij/Codex attach handshake and repeats this input storm. That leaves the Codex prompt polluted. test-reverb was detached at the time of inspection; interview-memvid was still attached but already polluted.

Current conclusion: this is not primarily a dead daemon/session problem. The root issue appears to be that terminal-generated control-response bytes from the web terminal are being forwarded as interactive input through zellij attach, and Codex is more sensitive to that than Claude. The stale UI is a symptom of Codex TUI corruption after those bytes are injected.

Follow-up needed: classify the outbound mux.client.input payloads without logging sensitive raw user text, then filter or suppress terminal-generated control/mouse/report responses during attach/reconnect while preserving real keyboard input.

@whoisasx

Copy link
Copy Markdown
Collaborator Author

Update after the logging-led RCA and fix:

We added end-to-end terminal flow logging across the renderer, mux client, backend /mux, terminal attachment, PTY lifecycle, and zellij attach path. That let us stop guessing and compare the actual stale/corrupt runs against a fresh run.

Finding from the logs: the zellij sessions and daemon were not dying. The bad signature was terminal-generated/control bytes being sent back through the raw xterm input path as if they were user keystrokes. In the failing runs we saw attach-time input storms with repeated non-human-sized chunks, and those bytes leaked into the Codex prompt. That explained why the UI looked stale/corrupt even though the runtime sessions were still alive.

Decision taken: do not forward raw xterm onData as user input. The terminal now forwards only explicit user-originated input sources: keyboard, paste, and composition. We also gate input until the backend confirms the attach PTY is actually open, guard stale mux callbacks with attachment generations, buffer early backend input until PTY readiness, and make zellij attach quieter for the embedded client.

Current validation: after a fresh run with multiple projects/sessions/workers, the diagnostic log shows only real keyboard input (source=keyboard, bytes=1) being forwarded. The old 27/28-byte attach-time input storm is absent. The daemon API reports sessions alive, AO's isolated zellij namespace reports the sessions alive, and the app is now behaving correctly in the tested flow.

Committed and pushed in b25e8d8 (fix: harden terminal mux attachment flow).

@whoisasx whoisasx marked this pull request as ready for review June 20, 2026 23:14
@whoisasx whoisasx merged commit 5e8c8de into main Jun 21, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Initial findings and follow-on iteration to find actual RC: Electron terminal distortion with Codex/zellij

2 participants