Skip to content

iamfaham/Deafine

Β 
Β 

Repository files navigation

Deafine - Real-Time Multi-Speaker Transcription

Deafine is a simple Python accessibility tool that provides real-time speaker diarization and live transcription using ElevenLabs API for deaf and hard-of-hearing users.

Demo

Watch the demo

Click the image above to watch the demo on YouTube.

Features

  • 🎀 Real-time microphone capture (Console mode)
  • 🌐 WebSocket API - Stream audio from browser for live transcription
  • πŸ‘₯ Automatic multi-speaker detection (S1, S2, S3...) via ElevenLabs
  • πŸ“ Live transcription with speaker labels (full text, not summarized)
  • πŸ”€ Overlap detection when multiple speakers talk simultaneously
  • πŸ’Ύ Optional recording to save audio and transcripts
  • πŸ“Š Session summaries - Generated when closing (overall + per-speaker)
  • πŸš€ FastAPI REST API - Upload audio files or stream via WebSocket
  • ⚑ Simple and fast - no local ML models, no build tools required
  • πŸ’° Optional VAD - Save bandwidth and API costs (if webrtcvad installed)

Quick Start

Installation

# Create virtual environment
python -m venv .venv

# Activate (Linux/Mac)
source .venv/bin/activate

# Activate (Windows)
.venv\Scripts\activate

# Install dependencies
pip install -e .

# Copy environment template
cp env.template .env
# (or on Windows: copy env.template .env)

Configuration

IMPORTANT: You need an ElevenLabs API key.

  1. Get your API key from https://elevenlabs.io
  2. Edit .env and add your key:
ELEVEN_API_KEY=your_key_here

Running

Basic usage:

audio-access run

With recording enabled:

audio-access run --record

Running as API Server

Start the FastAPI server:

uvicorn audio_access.api:app --reload --host 0.0.0.0 --port 8000

Or with Docker:

docker-compose up --build

API will be available at http://localhost:8000

REST API Endpoints:

  • POST /transcribe - Upload audio file for transcription
  • POST /transcribe/stream - Async transcription for large files
  • GET /health - Health check
  • GET /docs - Interactive Swagger UI

WebSocket API:

  • ws://localhost:8000/ws/transcribe - Real-time audio streaming
  • GET /ws/sessions - List active WebSocket sessions

See WEBSOCKET_API.md for WebSocket integration guide.

How It Works

Console Mode (Local Microphone)

Microphone β†’ [Optional VAD] β†’ ElevenLabs API β†’ Live Transcription with Speakers
  1. Audio Capture: Records from microphone at 16kHz mono
  2. VAD (Optional): Filters silence to save bandwidth (if installed)
  3. ElevenLabs: Transcribes and identifies speakers (sent every 5 seconds)
  4. Display: Shows live output in console

WebSocket Mode (Browser/Frontend)

Browser Mic β†’ WebSocket β†’ FastAPI β†’ ElevenLabs API β†’ WebSocket β†’ Frontend Display
  1. Frontend: Captures audio from browser microphone (16kHz, 16-bit PCM)
  2. WebSocket: Streams audio chunks to backend in real-time
  3. ElevenLabs: Backend processes audio and identifies speakers
  4. WebSocket: Backend streams transcript segments back to frontend
  5. Frontend: Displays live transcription with speaker labels

See WEBSOCKET_API.md for integration details.

Requirements

  • Python 3.9+
  • ElevenLabs API key (required)
  • Microphone access
  • Internet connection

Optional: Voice Activity Detection

To save ~60% bandwidth and API costs, install webrtcvad:

Linux/Mac:

pip install webrtcvad

Windows: Requires Microsoft C++ Build Tools

# Install from: https://visualstudio.microsoft.com/visual-cpp-build-tools/
# Then:
pip install webrtcvad

Docker/Cloud: Works automatically (Linux-based)

The app works fine without VAD - it just sends all audio instead of filtering silence.

CLI Options

Console Mode:

audio-access run [OPTIONS]

Options:
  --record    Save audio and transcript to disk
  --help      Show help message

API Mode:

# Start API server
uvicorn audio_access.api:app --reload

# Or with Docker
docker-compose up

Display

β”Œβ”€ 🎀 Deafine - Live Transcription (ElevenLabs) [00:45] ─────┐
β”‚ Speaker  | Live Transcript                                  β”‚
β”‚ ──────────────────────────────────────────────────────────  β”‚
β”‚ S1       | Hello, how are you today?                        β”‚
β”‚ S2       | I'm doing great, thanks!                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Configuration Options

Edit .env to configure:

# Required
ELEVEN_API_KEY=your_key_here

# Optional: Enable/disable VAD (if webrtcvad installed)
DEAFINE_USE_VAD=true
DEAFINE_VAD_AGGRESSIVENESS=2  # 0-3 (higher = more aggressive)

# How often to send audio to ElevenLabs (seconds)
DEAFINE_ELEVENLABS_CHUNK_SECS=5

# Audio processing
DEAFINE_CHUNK_MS=320
DEAFINE_HOP_MS=160

Session Summaries

When you stop the app (Ctrl+C), you'll see:

πŸ“Š SESSION SUMMARY
======================================================================

πŸ“ Overall Conversation:
   Discussion about testing the Deafine app and confirming it works.

πŸ‘₯ Speaker Summaries:

   S1:
      Initiated testing of the application and asked for confirmation.
      (45 words, 12.3s speaking time)

   S2:
      Confirmed the application is working correctly.
      (23 words, 6.1s speaking time)

πŸ“ˆ Statistics:
   Total Speakers: 2
   Total Segments: 8

With AI Summaries (Optional - LangChain + OpenRouter)

Install:

pip install langchain langchain-openai

Get API key at: https://openrouter.ai/keys

Add to .env:

OPENROUTER_API_KEY=your_key
OPENROUTER_MODEL=mistralai/mistral-small-3.2-24b-instruct:free  # FREE! or use any model from OpenRouter

Benefits:

  • βœ… Access to multiple AI providers (OpenAI, Anthropic, Google, Meta, etc.)
  • βœ… Often cheaper than direct OpenAI
  • βœ… Better summaries than extractive method
  • βœ… Choose your preferred model

Popular Models:

  • mistralai/mistral-small-3.2-24b-instruct:free - FREE! Fast and good quality (default)
  • openai/gpt-4o-mini - Fast and cheap ($0.15/1M tokens)
  • anthropic/claude-3-haiku - Great quality, affordable
  • google/gemini-pro - Free tier available
  • meta-llama/llama-3.1-8b-instruct:free - Open source, free

See all models: https://openrouter.ai/models

Recording Output

When using --record, three files are created:

  • session_YYYYMMDD_HHMMSS.wav - Audio recording
  • session_YYYYMMDD_HHMMSS_transcript.jsonl - Transcript with timestamps
  • session_YYYYMMDD_HHMMSS_summary.md - Session summary (generated on close)

Troubleshooting

"ELEVEN_API_KEY is required"

  • Make sure you've created .env file with your ElevenLabs API key

No audio detected

  • Check microphone is connected and not muted
  • Try different microphone if available

Slow transcription

  • Normal behavior - transcription happens every 5 seconds
  • Adjust DEAFINE_ELEVENLABS_CHUNK_SECS in .env for faster updates (uses more API calls)

Want to reduce API costs?

  • Install webrtcvad: pip install webrtcvad
  • Saves ~60% by filtering silence

Cost Optimization

Mode Bandwidth API Costs Installation
Without VAD 100% Higher βœ… Easy (no build tools)
With VAD ~40% Lower ⚠️ Needs compiler (Windows)

License

MIT License - Built for accessibility

About

Deafine is a simple Python accessibility tool that provides real-time speaker diarization and live transcription for deaf/hard-of-hearing users

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 94.7%
  • Batchfile 2.1%
  • Shell 2.0%
  • Dockerfile 1.2%