Vendor-neutral GPU monitoring agent with risk intelligence.
One binary. Every GPU. Real risk scores — not just dashboards.
- Apple Silicon (M1–M5) — zero sudo, IOKit native
- NVIDIA consumer (RTX 3090/4090/5090) — nvidia-smi
- NVIDIA datacenter (H100/B200) — DCGM
- AMD (MI300X, RX 7900 XTX) — ROCm SMI
- Any Linux machine — hwmon/thermal sysfs
Fleet View for Mobile:
The left image shows a computer at an idle state, and the right image shows a computer running an inference job with ollama
# Download the latest release
curl -sfL https://github.com/keldron-ai/keldron-agent/releases/latest/download/keldron-agent-darwin-arm64 -o keldron-agent
chmod +x keldron-agent
./keldron-agent --local
# → Dashboard at http://localhost:9200
# → Prometheus metrics at http://localhost:9100/metricsOr build from source (needs Node.js / npm or pnpm for the full dashboard):
git clone https://github.com/keldron-ai/keldron-agent.git
cd keldron-agent
make build
./keldron-agent --localFor production use, prefer a GitHub release binary. make build from a clone produces the full dashboard.
# AMD64
curl -sfL https://github.com/keldron-ai/keldron-agent/releases/latest/download/keldron-agent-linux-amd64 -o keldron-agent
chmod +x keldron-agent
./keldron-agent --local
# Or ARM64 (e.g. Graviton)
curl -sfL https://github.com/keldron-ai/keldron-agent/releases/latest/download/keldron-agent-linux-arm64 -o keldron-agent
chmod +x keldron-agent
./keldron-agent --localOr with Docker (build the image locally, or use the registry image when published):
make docker-build
make docker-runPre-built image (when available on GHCR):
docker run --rm \
-p 9100:9100 -p 9200:9200 -p 8081:8081 \
-e KELDRON_OUTPUT_PROMETHEUS_HOST=0.0.0.0 \
-e KELDRON_API_HOST=0.0.0.0 \
-e KELDRON_HEALTH_BIND=0.0.0.0:8081 \
ghcr.io/keldron-ai/keldron-agent:latestWith a config file:
docker run --rm \
-p 9100:9100 -p 9200:9200 -p 8081:8081 \
-e KELDRON_OUTPUT_PROMETHEUS_HOST=0.0.0.0 \
-e KELDRON_API_HOST=0.0.0.0 \
-e KELDRON_HEALTH_BIND=0.0.0.0:8081 \
-v $(pwd)/configs/keldron-agent.example.yaml:/etc/keldron/keldron-agent.yaml:ro \
ghcr.io/keldron-ai/keldron-agent:latestcurl localhost:9100/metrics | grep keldron_gpu_temperature
# keldron_gpu_temperature_celsius{device_model="M4-Pro",device_vendor="apple",...} 52.3Stream telemetry to the cloud for 180-day history, fleet analytics, and device health tracking.
Use the keldron-agent binary from releases or make build.
Option 1: Interactive login
keldron-agent loginOption 2: Non-interactive login with an API key
export KELDRON_CLOUD_API_KEY=kldn_live_your_key_here
keldron-agent loginOr pipe the key: printf '%s' 'kldn_live_your_key_here' | keldron-agent login
Option 3: Run the agent with cloud streaming (no login step)
export KELDRON_CLOUD_API_KEY=kldn_live_your_key_here
keldron-agentCheck your connection:
keldron-agent whoamiSign up for free at app.keldron.ai.
| Command | Purpose |
|---|---|
login |
Authenticate with Keldron Cloud |
logout |
Remove stored credentials |
whoami |
Show current Cloud connection (masked API key and endpoint) |
scan |
One-shot device/fleet status query |
Run keldron-agent --help and keldron-agent <command> -h for flags.
Example Prometheus output (real data from Apple Silicon):
keldron_gpu_temperature_celsius{adapter="apple_silicon",behavior_class="soc_integrated",device_id="hostname:0",device_model="M4-Pro",device_vendor="apple"} 52.3
keldron_risk_composite{behavior_class="soc_integrated",device_id="hostname:0"} 12.4
keldron_risk_severity{device_id="hostname:0"} 0
keldron_power_cost_monthly{device_id="hostname:0"} 4.32
| Feature | nvidia-smi | keldron-agent |
|---|---|---|
| Raw temperature | ✅ | ✅ |
| Risk score (0–100) | ❌ | ✅ |
| "Time to thermal throttle" | ❌ | ✅ |
| Vendor-neutral | ❌ | ✅ |
| Power cost estimation | ❌ | ✅ |
| Prometheus endpoint | ❌ | ✅ |
Create keldron-agent.yaml:
agent:
device_name: "my-workstation"
poll_interval: "2s" # 2s–5m; use 10s–30s in production to reduce CPU/network load
log_level: "info"
electricity_rate: 0.12
adapters:
apple_silicon: # Mac: set enabled: true
enabled: true
nvidia_consumer: # Linux + NVIDIA: set enabled: true when nvidia-smi in PATH
enabled: false
dcgm: # Datacenter NVIDIA (H100/B200)
enabled: false
rocm: # AMD (MI300X, RX 7900)
enabled: false
output:
prometheus: true
prometheus_port: 9100Full config reference: configs/keldron-agent.example.yaml
| Metric | Type | Description |
|---|---|---|
keldron_gpu_temperature_celsius |
gauge | GPU temperature in Celsius |
keldron_gpu_hotspot_temperature_celsius |
gauge | GPU hotspot/junction temperature in Celsius |
keldron_gpu_power_watts |
gauge | GPU power draw in watts |
keldron_gpu_utilization_ratio |
gauge | GPU utilization 0–1 |
keldron_gpu_memory_used_bytes |
gauge | GPU memory used in bytes |
keldron_gpu_memory_total_bytes |
gauge | GPU memory total in bytes |
keldron_gpu_clock_sm_mhz |
gauge | GPU SM clock in MHz |
keldron_gpu_clock_max_mhz |
gauge | GPU max clock in MHz |
keldron_gpu_throttle_active |
gauge | 1 if GPU is throttled, 0 otherwise |
keldron_cpu_temperature_celsius |
gauge | CPU temperature in Celsius |
keldron_fan_speed_rpm |
gauge | Fan speed in RPM |
keldron_system_swap_used_bytes |
gauge | System swap used in bytes |
keldron_system_swap_total_bytes |
gauge | System swap total in bytes |
keldron_device_uptime_seconds |
gauge | Device uptime in seconds |
keldron_risk_composite |
gauge | Composite risk score |
keldron_risk_thermal |
gauge | Thermal risk score |
keldron_risk_power |
gauge | Power risk score |
keldron_risk_volatility |
gauge | Volatility risk score |
keldron_risk_memory |
gauge | Memory-related risk score |
keldron_risk_severity |
gauge | 0=normal, 1=active, 2=elevated, 3=warning, 4=critical |
keldron_risk_warming_up |
gauge | 1 if device warming up, 0 otherwise |
keldron_gpu_memory_pressure_ratio |
gauge | GPU memory used/total ratio |
keldron_gpu_clock_efficiency |
gauge | GPU clock efficiency ratio |
keldron_power_cost_hourly |
gauge | Estimated power cost per hour |
keldron_power_cost_daily |
gauge | Estimated power cost per day |
keldron_power_cost_monthly |
gauge | Estimated power cost per month |
keldron_gpu_hotspot_delta_celsius |
gauge | Hotspot minus edge temp (NVIDIA only); -1 if unavailable |
keldron_agent_info |
gauge | Agent info (always 1) |
Adapters → Normalizer → Risk Engine → Prometheus /metrics
(IOKit, NVML, Stdout JSON
ROCm, hwmon) Local dashboard :9200 (embedded UI)
Keldron Cloud (optional)
The web UI is embedded at build time from frontend/ into internal/api/static/ (make build or the Dockerfile). The committed internal/api/static/index.html is only a fallback so bare go build succeeds without Node.js.
The agent is read-only — it reads hardware sensors and computes scores. It does not execute arbitrary commands or alter system state beyond writing its own credential file (~/.keldron/credentials, created with 0600 permissions). Local HTTP servers (web UI on port 9200, Prometheus metrics on port 9100, health endpoint on port 8081) bind to 127.0.0.1 by default and are not exposed on public interfaces unless explicitly reconfigured.
- All HTTP servers bind to
127.0.0.1(localhost) by default. Override via config for LAN access. - Cloud telemetry is transmitted over HTTPS with TLS 1.2+.
- Credentials are stored with restricted file permissions (0600).
- The agent contains no tracking, analytics, or telemetry about your usage — only hardware sensor data.
To report a security issue, email ransom@keldron.ai.
A pre-built Grafana dashboard and Prometheus config live in examples/.
Quick start:
- Start the agent:
./keldron-agent --local - Start Prometheus + Grafana:
cd examples && docker compose -f docker-compose.grafana.yml up -d - Open Grafana at http://localhost:3000 (admin / password set via
GF_ADMIN_PASSWORDenv var) - Add a Prometheus data source: URL
http://prometheus:9090 - Import
examples/grafana-dashboard.json(Dashboards → Import)
Keldron also exposes metrics at /metrics on the agent for any Grafana or Prometheus setup you already run.
Want fleet dashboards, 180-day history, and device health analytics?
→ Sign up free at app.keldron.ai → Learn more at keldron.ai
PRs welcome. See our contributing guide.
Apache 2.0 — see LICENSE.


