This repository provides Docker containers for running the Surgical Agent Framework. The framework is split into five main components:
- vLLM Server: Hosts the large language model for agent interactions
- Whisper Server: Provides real-time speech-to-text capabilities
- UI Server: Serves the web interface and coordinates communication
- TTS Server: Provides text-to-speech voice synthesis capabilities
- WebRTC USB Camera: Streams video from USB cameras via WebRTC
Each component runs in its own container for better isolation and scalability. The containers communicate over the host network for simplicity in development.
NOTE: The setup only works on opening browser on the host machine, not through the ssh-tunnel and broadcasting over a local network.
Use the automated script to easily build and run all components:
cd docker
./run-surgical-agents.shThis will automatically:
- Check Docker availability
- Download the surgical LLM model if needed
- Build all Docker images
- Start all containers
- Show status and available endpoints
./run-surgical-agents.sh [ACTION] [COMPONENT]Available Actions:
build- Build Docker images onlyrun- Run containers (assumes images exist)build_and_run- Build images and run containers (default)download- Download the surgical LLM modelstop- Stop running containerslogs- Show container logsstatus- Show container statushelp- Show help message
Available Components (optional):
vllm- vLLM server onlywhisper- Whisper server onlyui- UI server onlytts- TTS server onlywebrtc_usbcam- WebRTC USB Camera server only- (no component) - All components (default)
# Build and run everything (default)
./run-surgical-agents.sh
# Build all components
./run-surgical-agents.sh build
# Run only the UI server
./run-surgical-agents.sh run ui
# Stop all containers
./run-surgical-agents.sh stop
# View logs for vLLM server
./run-surgical-agents.sh logs vllm
# Check status of all containers
./run-surgical-agents.sh status
# Download the surgical model only
./run-surgical-agents.sh downloadThe following environment variables can be used to customize the deployment:
GPU_MEMORY_UTILIZATION - Controls how much GPU memory the vLLM server uses (default: 0.25)
# Use 50% of GPU memory instead of default 25%
GPU_MEMORY_UTILIZATION=0.5 ./run-surgical-agents.sh
# Or for specific operations
GPU_MEMORY_UTILIZATION=0.8 ./run-surgical-agents.sh run vllmVLLM_ENFORCE_EAGER - Enable enforce eager mode for vLLM execution (default: false)
# Enable enforce eager mode for debugging or compatibility
VLLM_ENFORCE_EAGER=true ./run-surgical-agents.sh run vllm
# Combine multiple environment variables
GPU_MEMORY_UTILIZATION=0.5 VLLM_ENFORCE_EAGER=true ./run-surgical-agents.shThe framework includes optimized support for NVIDIA IGX Thor devices. When running on IGX Thor hardware, the system automatically detects the device and uses NVIDIA's prebuilt optimized vLLM container instead of building from source. This provides faster setup times and better performance on Thor hardware while maintaining full compatibility with x86_64 and aarch64 platforms. No additional configuration is required.
Once running, the following services will be available:
- vLLM Server: http://localhost:8000 (OpenAI API compatible)
- Whisper Server: http://localhost:8765 (Speech-to-Text)
- UI Server: http://localhost:8050 (Web Interface)
- TTS Server: http://localhost:8082 (Text-to-Speech)
- WebRTC USB Camera: http://localhost:8080 (USB Camera WebRTC Stream)
The following manual commands are available for advanced users who prefer direct Docker control:
- Build
git clone -b v0.8.4-dgpu git@github.com:mingxin-zheng/vllm.git
cd vllm
DOCKER_BUILDKIT=1 docker build . \
--file docker/Dockerfile \
--target vllm-openai \
--platform "linux/arm64" \
-t vlm-surgical-agents:vllm-openai-v0.8.3-dgpu \
--build-arg RUN_WHEEL_CHECK=false
rm -rf vllm-
Download the model to
<path-to-repo>/models/llmas the README describes -
Run
docker run -it --rm --net host --gpus all \
-v <path-to-repo>/models:/vllm-workspace/models \
vlm-surgical-agents:vllm-openai-v0.8.3-dgpu \
--model models/llm/Qwen2.5-VL-7B-Surg-CholecT50 \
--enforce-eager \
--max-model-len 4096 \
--max-num-seqs 8 \
--load-format bitsandbytes \
--quantization bitsandbytes- Build
docker build \
-t vlm-surgical-agents:whisper-dgpu \
-f docker/Dockerfile.whisper .- Run (model will be automatically downloaded)
docker run -it --rm --gpus all --net host \
-v <path-to-repo>/models/whisper:/root/whisper \
vlm-surgical-agents:whisper-dgpu \
--model_cache_dir /root/whisper- Build
docker build -t vlm-surgical-agents:ui -f docker/Dockerfile.ui .- Run
docker run -it --rm --net host vlm-surgical-agents:uiYou can now access the UI at http://localhost:8050
The Surgical Agent Framework supports two Text-to-Speech (TTS) options:
- Local TTS Service (Default) - runs on your hardware
- ElevenLabs TTS - Cloud-based, requires API key
The TTS service is included when you run all services:
# Build and run all services (including local TTS)
./run-surgical-agents.sh
# Or run local TTS service only
./run-surgical-agents.sh run tts# Run the test script to verify everything is working
python3 ../test-tts.py- Open http://localhost:8050 in your browser
- In the "Text-to-Speech" panel:
- ✅ Enable voice responses
- 🎯 Select "Local TTS" (default)
- Start a conversation and enjoy voice responses!
# Start TTS service
./run-surgical-agents.sh run tts
# Stop TTS service
./run-surgical-agents.sh stop tts
# View TTS logs
./run-surgical-agents.sh logs tts
# Check service status
./run-surgical-agents.sh statusThe TTS model is stored persistently in:
- Host Directory:
./tts-service/models/ - Container Path:
/root/.local/share/tts(symlinked to volume) - Auto-download: The model (
tts_models/en/ljspeech/vits) downloads automatically on first use
When running, the TTS service is available at:
- Health Check: http://localhost:8082/api/health
- API Documentation: http://localhost:8082/docs
- Models List: http://localhost:8082/api/models
- Build
docker build -t vlm-surgical-agents:tts -f tts-service/Dockerfile tts-service- Run (models will be automatically downloaded on first use)
docker run -it --rm --gpus all --net host \
-v <path-to-repo>/tts-service/models:/app/models \
-v <path-to-repo>/tts-service/cache:/app/cache \
-e TTS_MODELS_DIR=/app/models \
-e TTS_CACHE_DIR=/app/cache \
-e TTS_USE_CUDA=true \
-e PORT=8082 \
vlm-surgical-agents:ttsTTS Service Won't Start:
# Check if port 8082 is in use
sudo netstat -tlnp | grep 8082
# Check Docker logs
./run-surgical-agents.sh logs ttsNo Audio Output:
# Test the integration
python3 ../test-tts.py
# Check browser audio permissionsGPU Not Detected:
# Check NVIDIA Docker runtime
docker run --rm --gpus all nvidia/cuda:11.8-base-ubuntu22.04 nvidia-smiThe WebRTC USB Camera server streams video from USB cameras via WebRTC protocol. This is useful for testing real-time video feeds into the surgical agent framework.
# Build and run the WebRTC USB Camera server
./run-surgical-agents.sh build_and_run webrtc_usbcam
# Or with custom configuration
CAMERA_INDEX=1 CAMERA_FPS=60 WEBRTC_PORT=9090 ./run-surgical-agents.sh run webrtc_usbcamNOTE: The server only runs if a USB camera is connected to the system. By default, using ./run-surgical-agents.sh will not start the server.
The WebRTC USB Camera server supports the following environment variables:
- CAMERA_INDEX: USB camera device index (default: 0)
- CAMERA_FPS: Target frames per second (default: 30)
- WEBRTC_PORT: Server port (default: 8080)
Examples:
# Use camera 1 at 60 FPS on port 9090
CAMERA_INDEX=1 CAMERA_FPS=60 WEBRTC_PORT=9090 ./run-surgical-agents.sh run webrtc_usbcam
# Use default camera (0) at 30 FPS on default port (8080)
./run-surgical-agents.sh run webrtc_usbcamTo find available camera indices on your system:
# Check camera capabilities
v4l2-ctl --device=/dev/video0 --list-formats-ext./run-surgical-agents.sh status./run-surgical-agents.sh logs [component]./run-surgical-agents.sh stop./run-surgical-agents.sh stop
./run-surgical-agents.sh build_and_run