Voice-to-text tool that captures speech, transcribes it via Whisper, and formats the output with an LLM. Press a hotkey, speak, release — formatted text lands in your clipboard and is auto-pasted into whatever you were typing in.
┌──────────────────────┐ PCM chunks ┌──────────────────────────────────┐ formatted
│ Wails Desktop App │ ──WebSocket──▶ │ Cloudflare Worker (Durable Obj) │ ──text────▶ Clipboard → Auto-paste
│ (Go + React WebView)│ ◀─────────────│ Whisper STT → LLM Formatter │
└──────────────────────┘ └──────────────────────────────────┘
- Hold Ctrl+Cmd — focus context is captured, recording starts, overlay appears at top-center
- Speak into your microphone (voice level meter shows input)
- Release Ctrl+Cmd — audio streams to the cloud
- Whisper transcribes, LLM formats, result is copied to clipboard and auto-pasted into the originating app
voicebox/
├── main.go # Wails entrypoint, app menu
├── app.go # App lifecycle, hotkey handlers, pipeline orchestration
├── window_darwin.go # macOS window management (overlay, settings, dock click)
├── window_other.go # Stub for non-macOS builds
├── internal/
│ ├── audio/ # PCM audio capture (malgo/miniaudio), RMS level
│ ├── pipeline/ # WebSocket client, streams audio + focus context to worker
│ ├── accessibility/ # macOS AX API: focused element context + auto-paste (Cmd+V)
│ ├── config/ # TOML config loading and saving
│ ├── hotkey/ # Global hotkey registration
│ ├── stt/ # STT provider interface (stubs)
│ └── formatter/ # LLM formatting provider interface (stubs)
├── frontend/ # React + Tailwind overlay UI (Vite)
│ └── src/
│ ├── App.tsx # Routes between settings mode and overlay mode
│ ├── components/
│ │ ├── settings-form.tsx # Config editor (react-hook-form + zod)
│ │ └── title-bar.tsx # Frameless title bar with drag region
│ └── hooks/
│ ├── use-voicebox.ts # voicebox:state / voicebox:mode / voicebox:level events
│ └── use-config.ts # GetConfig / SaveConfig / GetConfigPath bindings
├── worker/ # Cloudflare Worker (TypeScript)
│ ├── src/
│ │ ├── index.ts # Router: /ws (WebSocket), /health
│ │ ├── session.ts # Durable Object: audio accumulation + AI pipeline
│ │ ├── prompt.ts # System prompt + user message builder
│ │ ├── wav.ts # PCM-to-WAV wrapper
│ │ └── types.ts # Shared types
│ ├── test/ # Vitest tests
│ └── wrangler.jsonc # Worker configuration
├── go.mod
└── voicebox.toml # User config (gitignored)
- Go 1.24+
- Node.js + pnpm
- Wails v2 CLI
- A Cloudflare account with Workers AI access
- macOS (accessibility permission required for auto-paste)
cd worker
pnpm install
wrangler secret put VOICEBOX_TOKEN # set a shared secret
pnpm deployOn first launch, VoiceBox opens a settings window. You can also create the config manually at ~/.config/voicebox/voicebox.toml:
[provider]
mode = "cloud"
[cloud]
worker_url = "https://voicebox.<your-subdomain>.workers.dev"
token = "your-shared-secret"
[audio]
sample_rate = 16000
channels = 1
chunk_size = 4096
[hotkey]
record = "ctrl+cmd"Config is loaded from (in order): ~/.config/voicebox/voicebox.toml, next to the binary, then ./voicebox.toml.
Auto-paste requires macOS Accessibility access. On first use, macOS will prompt for permission, or you can grant it manually in System Settings → Privacy & Security → Accessibility.
wails dev # dev mode with hot reload
wails build # production binarySettings (700×450, centered): Opens on launch, dock click, or via the Recording menu. Edit config here.
Overlay (160×48, top-center, floating): Appears during recording. Shows recording indicator with voice level meter, spinner while processing, checkmark on success.
Client connects to GET /ws?token=<auth-token>.
After receiving {"type":"ready"}, the client sends a configure message with audio and focus context, then streams binary PCM chunks:
Client Server
│── connect /ws?token=... ──────▶│
│◀── {"type":"ready"} ──────────│
│── {"type":"configure", ...} ──▶│
│── [binary PCM chunk] ─────────▶│
│── [binary PCM chunk] ─────────▶│
│── {"type":"audio_end"} ───────▶│
│◀── {"type":"processing",...} ──│
│◀── {"type":"result",...} ──────│
The configure message carries audio params and focused element context (app name, bundle ID, element role, title, placeholder, current value) used by the LLM formatter to tailor output.
- STT:
@cf/openai/whisper-large-v3-turbo - Formatter:
@cf/qwen/qwen3-30b-a3b-fp8
- STT: faster-whisper
- Formatter: Ollama
- Provider interfaces exist at
internal/stt/andinternal/formatter/
- 16kHz sample rate, mono, PCM signed 16-bit LE
- ~4096 byte chunks (~128ms each)
- Max recording: ~25 MiB (~13 minutes)
# Desktop app
wails dev # dev server (Go + Vite hot reload)
wails build # production build
go vet ./... # lint Go
go test ./internal/... # test Go
# Frontend
cd frontend && pnpm install && pnpm build
# Worker
cd worker
pnpm dev # local dev server
pnpm lint # type-check
pnpm format # prettier
pnpm test # vitest
pnpm deploy # deploy to Cloudflare