VoiceBox

Voice-to-text tool that captures speech, transcribes it via Whisper, and formats the output with an LLM. Press a hotkey, speak, release — formatted text lands in your clipboard and is auto-pasted into whatever you were typing in.

How It Works

┌──────────────────────┐  PCM chunks   ┌──────────────────────────────────┐   formatted
│  Wails Desktop App   │ ──WebSocket──▶ │  Cloudflare Worker (Durable Obj) │ ──text────▶ Clipboard → Auto-paste
│  (Go + React WebView)│ ◀─────────────│  Whisper STT → LLM Formatter     │
└──────────────────────┘               └──────────────────────────────────┘

Hold Ctrl+Cmd — focus context is captured, recording starts, overlay appears at top-center
Speak into your microphone (voice level meter shows input)
Release Ctrl+Cmd — audio streams to the cloud
Whisper transcribes, LLM formats, result is copied to clipboard and auto-pasted into the originating app

Project Structure

voicebox/
├── main.go                 # Wails entrypoint, app menu
├── app.go                  # App lifecycle, hotkey handlers, pipeline orchestration
├── window_darwin.go        # macOS window management (overlay, settings, dock click)
├── window_other.go         # Stub for non-macOS builds
├── internal/
│   ├── audio/              # PCM audio capture (malgo/miniaudio), RMS level
│   ├── pipeline/           # WebSocket client, streams audio + focus context to worker
│   ├── accessibility/      # macOS AX API: focused element context + auto-paste (Cmd+V)
│   ├── config/             # TOML config loading and saving
│   ├── hotkey/             # Global hotkey registration
│   ├── stt/                # STT provider interface (stubs)
│   └── formatter/          # LLM formatting provider interface (stubs)
├── frontend/               # React + Tailwind overlay UI (Vite)
│   └── src/
│       ├── App.tsx         # Routes between settings mode and overlay mode
│       ├── components/
│       │   ├── settings-form.tsx  # Config editor (react-hook-form + zod)
│       │   └── title-bar.tsx      # Frameless title bar with drag region
│       └── hooks/
│           ├── use-voicebox.ts    # voicebox:state / voicebox:mode / voicebox:level events
│           └── use-config.ts      # GetConfig / SaveConfig / GetConfigPath bindings
├── worker/                 # Cloudflare Worker (TypeScript)
│   ├── src/
│   │   ├── index.ts        # Router: /ws (WebSocket), /health
│   │   ├── session.ts      # Durable Object: audio accumulation + AI pipeline
│   │   ├── prompt.ts       # System prompt + user message builder
│   │   ├── wav.ts          # PCM-to-WAV wrapper
│   │   └── types.ts        # Shared types
│   ├── test/               # Vitest tests
│   └── wrangler.jsonc      # Worker configuration
├── go.mod
└── voicebox.toml           # User config (gitignored)

Setup

Prerequisites

Go 1.24+
Node.js + pnpm
Wails v2 CLI
A Cloudflare account with Workers AI access
macOS (accessibility permission required for auto-paste)

Deploy the Worker

cd worker
pnpm install
wrangler secret put VOICEBOX_TOKEN     # set a shared secret
pnpm deploy

Configure the Desktop Client

On first launch, VoiceBox opens a settings window. You can also create the config manually at ~/.config/voicebox/voicebox.toml:

[provider]
mode = "cloud"

[cloud]
worker_url = "https://voicebox.<your-subdomain>.workers.dev"
token = "your-shared-secret"

[audio]
sample_rate = 16000
channels = 1
chunk_size = 4096

[hotkey]
record = "ctrl+cmd"

Config is loaded from (in order): ~/.config/voicebox/voicebox.toml, next to the binary, then ./voicebox.toml.

macOS Accessibility Permission

Auto-paste requires macOS Accessibility access. On first use, macOS will prompt for permission, or you can grant it manually in System Settings → Privacy & Security → Accessibility.

Build and Run

wails dev      # dev mode with hot reload
wails build    # production binary

Window Modes

Settings (700×450, centered): Opens on launch, dock click, or via the Recording menu. Edit config here.

Overlay (160×48, top-center, floating): Appears during recording. Shows recording indicator with voice level meter, spinner while processing, checkmark on success.

WebSocket Protocol

Client connects to GET /ws?token=<auth-token>.

After receiving {"type":"ready"}, the client sends a configure message with audio and focus context, then streams binary PCM chunks:

Client                          Server
  │── connect /ws?token=... ──────▶│
  │◀── {"type":"ready"} ──────────│
  │── {"type":"configure", ...} ──▶│
  │── [binary PCM chunk] ─────────▶│
  │── [binary PCM chunk] ─────────▶│
  │── {"type":"audio_end"} ───────▶│
  │◀── {"type":"processing",...} ──│
  │◀── {"type":"result",...} ──────│

The configure message carries audio params and focused element context (app name, bundle ID, element role, title, placeholder, current value) used by the LLM formatter to tailor output.

Cloud Backend

STT: @cf/openai/whisper-large-v3-turbo
Formatter: @cf/qwen/qwen3-30b-a3b-fp8

Local Backend (Phase 2)

STT: faster-whisper
Formatter: Ollama
Provider interfaces exist at internal/stt/ and internal/formatter/

Audio Specs

16kHz sample rate, mono, PCM signed 16-bit LE
~4096 byte chunks (~128ms each)
Max recording: ~25 MiB (~13 minutes)

Development

# Desktop app
wails dev                              # dev server (Go + Vite hot reload)
wails build                            # production build
go vet ./...                           # lint Go
go test ./internal/...                 # test Go

# Frontend
cd frontend && pnpm install && pnpm build

# Worker
cd worker
pnpm dev                               # local dev server
pnpm lint                              # type-check
pnpm format                            # prettier
pnpm test                              # vitest
pnpm deploy                            # deploy to Cloudflare

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
frontend		frontend
server		server
src-tauri		src-tauri
worker		worker
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
voicebox.example.json		voicebox.example.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VoiceBox

How It Works

Project Structure

Setup

Prerequisites

Deploy the Worker

Configure the Desktop Client

macOS Accessibility Permission

Build and Run

Window Modes

WebSocket Protocol

Cloud Backend

Local Backend (Phase 2)

Audio Specs

Development

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VoiceBox

How It Works

Project Structure

Setup

Prerequisites

Deploy the Worker

Configure the Desktop Client

macOS Accessibility Permission

Build and Run

Window Modes

WebSocket Protocol

Cloud Backend

Local Backend (Phase 2)

Audio Specs

Development

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages