AssemblyAI · alexkroman · Jun 22, 2026 · Jun 22, 2026 · Jun 22, 2026 · Jun 22, 2026
diff --git a/.claude/settings.json b/.claude/settings.json
@@ -73,6 +73,19 @@
       "Read(**/*.p12)"
     ]
   },
+  "sandbox": {
+    "network": {
+      "allowLocalBinding": true,
+      "allowMachLookup": ["com.apple.SystemConfiguration.configd"]
+    },
+    "filesystem": {
+      "allowWrite": [
+        "~/.cache/uv",
+        "~/.local/share/uv",
+        "~/.local/state/uv"
+      ]
+    }
+  },
   "hooks": {
     "SessionStart": [
       {

diff --git a/.importlinter b/.importlinter
@@ -13,7 +13,7 @@ type = layers
 ; assembles the command layer — main, command_registry, help_panels, options —
 ; stays at the package root, above `commands`, and is intentionally unlisted
 ; (it legitimately imports the command modules to discover/register them).
-; Feature slices (agent, tts, streaming, code_agent, code_gen, init, auth, onboard) are
+; Feature slices (agent, tts, streaming, agent_cascade, code_gen, init, auth, onboard) are
 ; likewise unlisted vertical slices governed by contract 2.
 layers =
     commands
@@ -34,7 +34,6 @@ source_modules =
     aai_cli.agent
     aai_cli.agent_cascade
     aai_cli.auth
-    aai_cli.code_agent
     aai_cli.code_gen
     aai_cli.init
     aai_cli.onboard

diff --git a/README.md b/README.md
@@ -51,7 +51,6 @@ That's it. Run `assembly onboard` for a guided tour, or see [Installation](#-ins
 | `assembly live` | Talk live to a tool-using voice agent, wired client-side from Streaming STT + a deepagents brain on the LLM Gateway + streaming TTS — it can web-search, fetch URLs, and read the docs mid-conversation, like the `agent-cascade` starter (sandbox-only) |
 | `assembly speak` | Synthesize text to speech over the streaming-TTS WebSocket (sandbox-only) |
 | `assembly llm` | Prompt the LLM Gateway over a transcript, files, stdin, or a live stream |
-| `assembly code` | Terminal coding agent (deepagents SDK) backed only by the LLM Gateway — reads/writes/edits files, runs shell, searches the docs MCP, and can invoke the `assembly` CLI itself; mutating actions ask for approval. Defaults to voice in a terminal (speak your request, replies read back via streaming TTS in the sandbox); pass `--no-voice` for the keyboard TUI |
 | `assembly clip` | Cut audio/video with ffmpeg by diarized speaker, text match, LLM pick, or time range (`--video` keeps the picture for URL sources) — clip boundaries snap into nearby silence |
 | `assembly dub` | Re-voice an audio/video file or URL in another language: transcription, LLM translation, per-speaker TTS, ffmpeg track-swap (sandbox-only) |
 | `assembly caption` | Burn always-visible captions into a video: transcribe (or reuse a transcript), fetch SRT, ffmpeg burns it in — audio untouched |

diff --git a/REFERENCE.md b/REFERENCE.md
@@ -159,3 +159,18 @@ Each server is launched independently and best-effort: one that won't start (a
 missing `npx`/`uvx`, an offline host) drops only its own tools, so a single broken
 tool never sinks the session. MCP tools are a live-run feature and are not
 reflected in `--show-code` output.
+
+`--files` lets the agent read, write, and run code in the directory you launch
+it from (off by default). Reads run immediately; a write, edit, or command run pauses
+the turn for confirmation in the voice TUI — press `y`/`n` (`a` approves the rest of the
+session) or just say it ("approve" / "run it" / "go ahead"; anything unclear is treated as
+a no). Destructive commands (e.g. `rm -rf`, `sudo`) ignore the spoken answer and require a
+keypress. Commands run OS-sandboxed in that directory — confined to it, with no network
+access — on macOS (`sandbox-exec`) and Linux (`bwrap`); on any other platform, or if the
+sandbox tool is missing, running code is refused rather than run unconfined. Access is
+rooted at the launch directory — the agent can't escape it. It can also delegate a
+focused subtask to a helper (a sandboxed general-purpose subagent), whose own writes and
+runs need the same confirmation. The agent also keeps a per-project memory file
+(`./.deepagents/AGENTS.md`) so it resumes knowing what it was working on. A non-interactive
+run (a file/URL source, `--json`, `-o text`, or a non-TTY) has no way to confirm a write or
+run, so those are declined there while reads still work.
diff --git a/aai_cli/AGENTS.md b/aai_cli/AGENTS.md
@@ -44,7 +44,7 @@ contract:
   `help_panels`, `options`. They assemble/define the command layer (and
   `command_registry` imports the command modules to discover them), so they live
   *above* `commands` and stay at the root.
-- **Feature slices** — `agent/`, `tts/`, `streaming/`, `code_agent/`, `code_gen/`,
+- **Feature slices** — `agent/`, `tts/`, `streaming/`, `code_gen/`,
   `init/`, `auth/`, `onboard/`. These are cohesive vertical slices that internally mix
   protocol + rendering, so they aren't a single horizontal layer; contract 2
   forbids them from importing `commands`.
@@ -151,9 +151,8 @@ heavily-reworked commands with long bodies; small commands keep the inline
 - **`streaming/`** + `client.stream_audio` — v3 realtime API. Event callbacks run on the SDK reader thread and guard against `BrokenPipeError` (`stdio.silence_stdout()`) so a closed pipe never dumps a thread traceback.
 - **`core/sync_stt.py`** + **`core/signals.py`** + `commands/dictate/` — `assembly dictate`: headless dictation over the **Sync STT API** (`Environment.sync_base`, one POST `/transcribe` per utterance with the required `X-AAI-Model: u3-sync-pro` header; 80 ms–120 s of PCM/WAV). It needs no terminal: recording starts immediately and `dictate_exec._record` polls `signals.stop_on_terminate` between ~100 ms mic chunks for a SIGTERM, which finishes the utterance (clean exit 0) — so a hotkey tool like Hammerspoon can launch it as a background task and `kill -TERM`/`task:terminate()` to transcribe. SIGINT (Ctrl-C) still cancels (exit 130). Both boundaries (the stop latch, mic, HTTP) are injectable, so the suite never needs a real signal or microphone (`tests/test_dictate_exec.py` scripts the SIGTERM latch). Contrast `signals.terminate_as_interrupt` (used by `stream`/`agent`/`speak`), which routes SIGTERM into the *cancel* path instead.
 - **`agent/`** — full-duplex voice agent (mic in, TTS out via `voices.py`).
-- **`agent_cascade/`** + `commands/agent_cascade/` — `assembly agent-cascade`: the same live terminal conversation as `assembly agent`, but **client-orchestrated** — `engine.run_cascade` wires Streaming STT → the LLM Gateway → streaming TTS itself instead of talking to the Voice Agent endpoint, mirroring what the `agent-cascade` `assembly init` template does server-side. **Sandbox-only** (streaming TTS has no prod host; guarded via `tts.session.require_available`). Reuses the agent slice's `DuplexAudio`/`AgentRenderer` and `core.client.stream_audio`/`core.llm.complete`/`tts.session.synthesize`; the three network legs are injected through `engine.CascadeDeps` (the `tts/session.py` seam) so the cascade — greeting, per-sentence TTS, barge-in, history window — is unit-tested against fakes with no sockets/mic/speaker. The LLM leg is a deepagents graph (`brain.py`); under `-v` (`debuglog.active()`) `brain._run_graph` *streams* that graph instead of `invoke`-ing it and logs each tool call/result/interim line as it lands (reusing `code_agent.events.message_events`), so a spoken turn that stalls mid-tool is debuggable — plain `invoke` runs the whole loop internally and `-v` would otherwise show only the httpx lines. **Front-end:** an interactive mic session in human mode runs a **voice-only Textual TUI** (`agent_cascade/tui.py`, `LiveAgentApp`) by default — there's no text input (you can't type to it), just a transcript + an animated voice bar tracking listening/thinking/speaking. It shares the `assembly code` TUI's chrome (`code_agent.banner` wordmark, `code_agent.messages` widgets, `code_agent.tui_status.voicebar_markup`/`VOICE_FRAMES`); the blocking `run_cascade` runs on a worker thread and reaches the UI through a `_TuiRenderer` (the `engine.Renderer` protocol) that hops each call onto the UI thread, and a quit calls `DuplexAudio.close` to end the mic iterator and unblock that worker. `_exec._should_use_tui` gates it: file/sample input, `--json`/`-o text`, and a non-TTY all fall back to the plain `AgentRenderer` line output.
+- **`agent_cascade/`** + `commands/agent_cascade/` — `assembly agent-cascade`: the same live terminal conversation as `assembly agent`, but **client-orchestrated** — `engine.run_cascade` wires Streaming STT → the LLM Gateway → streaming TTS itself instead of talking to the Voice Agent endpoint, mirroring what the `agent-cascade` `assembly init` template does server-side. **Sandbox-only** (streaming TTS has no prod host; guarded via `tts.session.require_available`). Reuses the agent slice's `DuplexAudio`/`AgentRenderer` and `core.client.stream_audio`/`core.llm.complete`/`tts.session.synthesize`; the three network legs are injected through `engine.CascadeDeps` (the `tts/session.py` seam) so the cascade — greeting, clause-level streaming TTS, barge-in, history window — is unit-tested against fakes with no sockets/mic/speaker. The LLM leg is a deepagents graph (`brain.py`) streamed token-by-token via `brain.build_streamer` (`graph.stream(stream_mode="messages")`): the engine buffers `SpeechDelta`s, flushes complete clauses with `text.pop_clauses` (soft-separator clauses gated by `engine._MIN_CLAUSE_CHARS`), and synthesizes each clause with **streaming TTS** (`tts.session.synthesize(on_audio=…)`) so audio starts on the first frame instead of after the whole reply. The reply runs on a throwaway producer thread feeding a `queue.Queue` the worker drains under a monotonic deadline (the wall-clock backstop that replaced `_complete_within`), and an abandoned-on-timeout graph leg's langchain `ThreadPoolExecutor` worker is detached (`_detach_executor_threads_since`) so it can't wedge interpreter exit. A `ToolNotice` surfaces the "Searching the web…" affordance and drops any unspoken preamble. Under `-v` (`debuglog.active()`) `brain._stream_graph` logs each accumulated assistant line, tool call, and tool result as it streams. **Front-end:** an interactive mic session in human mode runs a **voice-only Textual TUI** (`agent_cascade/tui.py`, `LiveAgentApp`) by default — there's no text input (you can't type to it), just a transcript + an animated voice bar tracking listening/thinking/speaking. It uses its own `banner` wordmark, `messages` widgets, and `tui_status.voicebar_markup`/`VOICE_FRAMES` — all modules that now live in `agent_cascade/`; the blocking `run_cascade` runs on a worker thread and reaches the UI through a `_TuiRenderer` (the `engine.Renderer` protocol) that hops each call onto the UI thread, and a quit calls `DuplexAudio.close` to end the mic iterator and unblock that worker. `_exec._should_use_tui` gates it: file/sample input, `--json`/`-o text`, and a non-TTY all fall back to the plain `AgentRenderer` line output. **`--files`** (off by default) swaps the brain's in-memory backend for a real-cwd, sandbox-capable `SandboxedShellBackend` (`aai_cli/agent_cascade/sandbox.py`): file ops behave as before (traversal-blocked `virtual_mode`), and because it implements `SandboxBackendProtocol` deepagents binds a *functional* `execute` that runs commands OS-sandboxed in the real cwd — `sandbox-exec` (SBPL) on macOS, `bwrap` on Linux, refused (never an unconfined fallback) on any other platform or with the sandbox binary missing; the OS sandbox blocks the network, confines writes to cwd (+ the temp dir), and read-denies credential stores (`~/.ssh`/`~/.aws`/…, `.env*`, `.claude/`). The policy renderers are pure and the subprocess/capability boundaries injected, so the suite asserts *what we'd run* with no real sandbox. `write_file`/`edit_file`/`execute` are gated via `interrupt_on` + an `InMemorySaver`; `brain._stream_gated` detects the post-stream interrupt (`graph.get_state(config).interrupts`), asks an injected `Approver`, and resumes with `Command(resume=…)`, bracketing the human wait in `ApprovalPause` events so `engine._consume` suspends its reply deadline (`risk.py` surfaces a shell-risk warning on the prompt). The voice TUI supplies the approver via `agent_cascade.modals.ApprovalScreen` (`y`/`a`/`n`), which can *also* be resolved hands-free by voice: while a write awaits approval, `_consume` arms `_awaiting_approval` and `engine.on_turn` routes the next final transcript to `app.submit_voice_approval` → `ApprovalScreen.try_voice`, which applies `spoken_approval.spoken_decision` (an unambiguous affirmative approves, anything else rejects — fail-safe; destructive `risk.py`-flagged commands ignore the spoken answer and require a keypress). Headless runs auto-deny (`_exec._deny_writes`). `--files` also turns on durable per-project memory via deepagents' `MemoryMiddleware` (`memory=["./.deepagents/AGENTS.md"]`), distinct from the in-session `InMemorySaver`, and binds one gateway-bound, sandbox-backed general-purpose subagent (deepagents' `task` tool; spec in `agent_cascade/subagents.py`, omitting `model`/`tools` so it inherits both) for delegating a focused subtask. The subagent's own `interrupt_on` mirrors `_WRITE_TOOLS`, and a delegated `write_file`/`edit_file`/`execute` surfaces at the *parent* `get_state().interrupts` (so `_pending_writes` gates it too — verified by a HITL spike, locked in `tests/test_agent_cascade_subagents.py`). Reads (incl. `grep`) stay ungated.
 - **`tts/`** + `commands/speak.py` — `assembly speak` synthesizes text to speech over the sandbox streaming-TTS WebSocket (`streaming-tts.sandbox000.…`). **Sandbox-only:** `session.is_available()` is false in production (empty `Environment.streaming_tts_host`), so the command exits 2 with a `--sandbox` hint. `session.synthesize` drives a Begin→Generate→Flush→Audio→Terminate protocol with an injectable `connect` for hermetic tests (mirrors `agent/session.py`); `audio.py` plays the PCM (default) or writes a WAV (`--out`). The single-voice default-playback path **streams**: `synthesize`'s `on_audio(chunk, sample_rate)` callback is wired to `audio.PcmPlayer.feed`, so speech starts on the first Audio frame (it opens the device lazily, since the rate is only known at Begin) instead of after the whole text — the win for a long `--url` page. `--out` (needs the full buffer) and the multi-voice dialogue path (`synthesize_dialogue` → `_output_audio` → buffered `play_pcm`) stay buffered; `synthesize` still returns the complete PCM for the summary regardless.
-- **`code_agent/`** + `commands/code/` — `assembly code`: a terminal coding agent (a bespoke port of langchain-ai/deepagents' `code` agent) that talks **only** to the LLM Gateway. `model.py` pins the model to `ChatOpenAI` against `llm_gateway_base`; `agent.py` builds the deepagents graph over a cwd-scoped `LocalShellBackend` (filesystem + shell tools), plus extra tools: the custom `assembly` CLI tool (`cli_tool.py`, runs `python -m aai_cli` with the key via child env, never argv), a URL `fetch_url` tool (`fetch_tool.py`), Firecrawl web search when `FIRECRAWL_API_KEY` is set (`firecrawl_search.py`, shared with the live voice agent), an `ask_user` tool routed through an `AskBridge` to the front-end (`ask_tool.py`), and best-effort docs MCP tools (`docs_mcp.py`). Middleware adds installed skills (`skills.py`) and long-term memory (`memory.py`), each over its own dedicated backend. Sessions persist via a SQLite checkpointer (`store.py`) keyed by `--session`, so conversations resume. Approval gates the mutating tools (write/edit/execute/`assembly`/`fetch_url`); the general-purpose `task` subagent comes from deepagents by default. `session.py` drives the graph turn-by-turn (interrupt/resume = human approval), emitting framework-agnostic `events.py` to either the Textual TUI (`tui.py`, modeled on deepagents-code: transcript + input + approval/ask modals + clipboard copy) or the Rich fallback (`render.py`). The whole orchestration is tested by driving the **real** graph with a fake `BaseChatModel` (`tests/test_code_agent.py`), so no network/TTY is needed. **Voice is the default front-end in an interactive TTY** (`voice.py` + `_exec._run_voice`): `VoiceSession.listen` captures one spoken turn over Streaming STT (gating the mic shut the instant a turn finalizes) and `VoiceSession.speak` reads each assistant reply back over streaming TTS. It runs the **Rich REPL** loop (not the keyboard TUI) with a voice `read_line` + a reply-speaking sink. Readback needs streaming TTS, so it's **sandbox-only** (`tts.session.is_available`); in production the mic input still works and replies stay on screen. A mic-less box degrades to typed input on the first `AUDIO_ERROR_TYPES` `CLIError`; `--no-voice` selects the TUI, and a non-TTY (pipe/CI) the headless loop. Both legs (STT/TTS) are injected like the cascade's, so `tests/test_code_voice.py` drives it with fakes — no mic/speaker/socket.
 - **`code_gen/`** — backs `--show-code` on `transcribe`/`stream`/`agent`: builds a ready-to-run Python SDK script from exactly the flags passed (no API key needed; generated code reads `ASSEMBLYAI_API_KEY`).
 - **`auth/`** — browser-assisted `assembly login` via AMS + **Stytch B2B OAuth discovery** (`discovery.py`, `flow.py`, `loopback.py`, `ams.py`). Not Stytch Connected Apps.
 - **`init/`** — scaffolds a self-contained FastAPI + HTML starter (`audio-transcription`/`live-captions`/`voice-agent` templates), optionally installs deps and opens the browser; writes the key to a git-ignored `.env`.

diff --git a/aai_cli/agent/audio.py b/aai_cli/agent/audio.py
@@ -102,6 +102,11 @@ def __init__(
         # access goes through `_lock`. `_out_state` (the target->device ratecv state)
         # is touched ONLY by feed(), never the callback, so it needs no lock.
         self._in: queue.Queue[bytes | None] = queue.Queue()
+        # The mic gate: set = listening (real audio), clear = muted (silence to STT). Flipped
+        # from the UI thread (start/stop listening), read on the capture thread, so it's an
+        # Event rather than a bare bool. Starts open — a session listens as soon as it connects.
+        self._listening = threading.Event()
+        self._listening.set()
         # How long capture_frames() waits for a chunk before checking whether the
         # device stream silently died (e.g. unplugged); injectable for fast tests.
         self._poll_timeout = poll_timeout
@@ -179,12 +184,40 @@ def capture_frames(self) -> Iterator[bytes]:
                 continue
             if chunk is None:
                 return
+            if not self._listening.is_set():
+                # Muted: feed silence of the same length so the recognizer keeps receiving
+                # audio (the socket stays alive) but hears nothing, instead of stalling the
+                # stream. Resampling zeros still yields zeros, so gate before the resample.
+                chunk = bytes(len(chunk))
             if self._device_rate != self._target:
                 chunk, state = resample_pcm16(
                     chunk, state, src_rate=self._device_rate, dst_rate=self._target
                 )
             yield chunk
 
+    def set_listening(self, *, on: bool) -> None:
+        """Open or mute the mic in place, without tearing down the stream.
+
+        Muting keeps the full-duplex stream and the live STT/TTS session alive — captured
+        frames are zeroed to silence (see :meth:`capture_frames`) — so toggling back on
+        resumes listening instantly, with no socket reconnect.
+        """
+        if on:
+            self._listening.set()
+        else:
+            self._listening.clear()
+
+    def toggle_listening(self) -> bool:
+        """Flip the mic between listening and muted; return the resulting listening state."""
+        on = not self._listening.is_set()
+        self.set_listening(on=on)
+        return on
+
+    @property
+    def listening(self) -> bool:
+        """Whether the mic is feeding real audio to STT (vs muted silence)."""
+        return self._listening.is_set()
+
     def close(self) -> None:
         self._in.put(None)  # end capture_frames()
         if self._stream is not None: