👉 Try the live demo — read-only, seeded with realistic incidents, no signup
AI-powered incident response for DevOps teams. Connects to your monitoring stack via MCP to investigate incidents, scan for problems before they alert, and deliver structured RCA reports with evidence — automatically. Skip the part where you tab between five dashboards at 3am trying to figure out what broke.
The Operations Desk — live service catalog, investigation log, recent scan runs, and event rail in one view.
- Context-enriched alerts, automatically — every trigger (operator chat, Alertmanager webhook, health-poller transition, or scheduled scan) auto-runs a bounded 6-phase RCA pipeline (metrics + logs + infra + changes, in parallel). The page arrives already investigated in ~2 minutes, no human in the loop.
- Deep Investigation — an autonomous root-cause agent — point it at an incident and it hypothesizes, tests, and follows the cause across service boundaries until it confirms one — or pauses and hands an ambiguous call back to you. Streams live, writes its conclusion back to the report. (Ships gated off.)
- Proactive scanning — a cron probe evaluates PromQL/LogQL rules per service (availability, restart storms, log-error bursts, custom) and auto-investigates when one trips.
- AI service discovery — a setup wizard walks your Prometheus/Loki stack and populates the catalog with metrics, log labels, and probe rules (
npm run discoverfor CI). - MCP-agnostic — wire in Grafana, Kubernetes, GitLab, Coroot, or any MCP backend, assigned by role (metrics, logs, infra, changes, dependencies).
- Notifications — Slack + email per completed investigation, with per-recipient severity and source filters (Teams-safe HTML).
- Operations Desk + Activity — a live SOC console plus a unified
/activityview (Investigations · Scans · Patterns · Events); real-time over WebSocket, or a terminal CLI. - Multi-stack — prod, staging, and dev side-by-side, each with its own providers, services, rules, and history.
- LLM resilience — model calls retry with backoff on transient blips and fail loudly (never spin) when the provider stays down.
- Deploy anywhere — single Docker image, Helm chart, or
npm run web.
npm install
cp config.yaml.example config.yaml # then edit
export OPENAI_API_KEY=sk-...
npm run web # port 3000Open http://localhost:3000. The setup wizard walks you through Connect Provider → Discover Services → Monitor — point it at your Grafana MCP server and the AI populates the service catalog from your Prometheus labels. Headless equivalent: npm run discover.
For the full configuration reference (providers, scan rules, webhooks, notifications, SMTP), see the Ops Runbook or config.yaml.example.
dops-assistant gives you two ways to investigate, and they're built for different jobs:
| Investigation Pipeline | Deep Investigation | |
|---|---|---|
| What it is | Automatic first responder | Autonomous root-cause agent |
| Trigger | Every alert / scan / health transition | You point it at an incident |
| Shape | Fixed 6-phase pass, runs once | Unbounded reason→act loop |
| Goal | What's wrong — context-enriched alert | What's the real cause — across services |
| Cost | ~1× (~2 min) | 3–10× (guarded) |
The pipeline is what turns a raw alert into an answer. All four trigger sources (below) converge on the same bounded, 6-phase workflow, so an alert, scan hit, or health transition becomes an evidence-backed RCA report — root cause, timeline, evidence, recommended actions — with no human in the loop. Evidence gathering (metrics, logs, infra, changes) runs in parallel for speed; each agent gets only the MCP tools relevant to its role, so the metrics agent never sees log query tools and vice versa. One pass, ~2–3 minutes. By the time you open the page, the alert is already investigated.
The pipeline tells you what's wrong on every alert. Deep Investigation answers the harder question — what's the real, underlying cause — for the incidents that don't yield to a single pass.
It's an autonomous orchestrator. Instead of running a fixed sequence, it decides its next move every turn and keeps going until it earns a conclusion:
hypothesize → query evidence → test → follow the cause across services → conclude
Open any finished investigation and pick Investigate deeply:
- Challenge this RCA — a fast re-judge of the causes the pipeline ruled out.
- Full deep investigation — the unbounded hunt. It forms candidate causes, gathers read-only evidence for each, scores them against a deterministic test, and — crucially — follows the cause across service boundaries along your dependency graph. A degraded
checkoutthat's really a starvedpaymentsconnection pool gets traced topayments, not blamed on the symptom.
It stops honestly. A conclusion only sticks when the leading hypothesis is deterministically confirmed by evidence — the model's own confidence is never the gate. When it rules out every idea it had without finding the cause, it doesn't guess: it pauses and asks you — continue, escalate to on-call, or instrument & wait. Every other exit is a hard guard (token budget, wall-clock, query cap, consecutive strikes), because an autonomous run costs 3–10× a normal one.
A Full deep investigation streaming live in the Console — each move (hypothesize → query → test) lands in real time, with a live progress indicator and the moves · queries · subagents · strikes · tokens footer.
It shows its work. A confirmed run produces a causal chain — incident → each followed dependency (with the finding that pointed there) → root cause (with a Grafana deep-link to the exact panel that confirmed it) — plus a one-line trace summary like 12 moves · 5 queries · 2 subagents · confirmed at depth 1.
A confirmed run: the outcome banner, the cross-service causal chain with source attribution, and the trace summary.
Runs are durable — they survive a reload, reattach live across tabs, and auto-park when nobody's watching so they don't burn tokens headless. When it confirms a cause, one click applies it back to the RCA report — operator-gated and reversible, with a banner preserving the original cause.
Deep Investigation ships gated off (
agent.autonomousInvestigationEnabled: false) while it finishes live validation. The full move-loop design — every guard, the operator-pause sequence, the false-confirm cross-service guard — is in Autonomous Orchestrator — the Agentic Loop.
Investigations can start four ways.
Type a message like admin-task is returning 500 errors since 4pm. The intent router detects an incident report and launches the full investigation pipeline with your message as context. Results stream back in real time.
Generate a bearer token in Settings -> Alert Webhooks, then point Alertmanager at the webhook URL shown for that stack. The default stack uses POST /api/webhook/alert; non-default stacks use POST /api/webhook/alert/<stack-slug>. The handler validates the stack-scoped bearer token, dedups recent alerts, extracts service/severity/labels, merges with the service's known metrics and log selectors from services.yaml, and runs a headless investigation.
# alertmanager.yml
receivers:
- name: alert-assistant
webhook_configs:
- url: http://assistant:3000/api/webhook/alert/<stack-slug>
http_config:
authorization:
type: Bearer
credentials: "<your-token>"Investigation depth (quick / standard / full) can be mapped per severity in config.
A background poller queries Prometheus every 60 seconds for deployment replica counts and up metrics. When a service transitions from healthy to down it auto-fires a quick investigation. No extra config — the poller uses the service registry and each service's known selectors.
A cron-scheduled probe walks every service on the cadence you set and evaluates four tracks of rules:
- Global rules — stack-wide availability written by the discovery agent, aware of whichever label key your stack actually uses (
app,service,job,deployment) - Per-service metric rules — discovery-written thresholds like pod-restart storms, using each service's real Kubernetes namespace
- Per-service log rules — LogQL
count_over_time(... |= error)scoped to the service's real Loki labels - Config-file defaults — hardcoded fallback rules from
config.yamlwhen discovery hasn't run
Each rule has hysteresis (consecutive-tick counters) so a single flap doesn't fire a scan. When a rule trips, the probe spawns a headless investigation and the scan run lands on the Operations Desk with status, phase breakdown, and links to each child investigation.
Every tick creates a durable ScanRun record at /scan/runs/:id — copy the link, download as PNG or Markdown, or fire it to Slack with one click.
A scan run that dispatched two investigations — the 3-phase probe/triage/investigate breakdown is preserved forever.
| Trigger | Context | Depth | Requires |
|---|---|---|---|
| Operator | High (natural language + time refs) | Configurable | Nothing extra |
| Alert webhook | Medium (alert labels + service config) | Per-severity template | Alertmanager config |
| Health poller | Medium (transition info + service config) | Quick | Prometheus provider |
| K8s event poller | Medium (pod restart + reason + service config) | Standard | Infrastructure (k8s) MCP provider |
| Proactive scan | Medium (rule trigger + service config) | Configurable per rule | scan.enabled: true |
Every investigation gets a shareable URL, a phase rail that streams live, and a structured RCA report with root cause, contributing factors, timeline, evidence (metrics + logs + infra + changes), and recommended actions.
Everything that happens in the system — investigations, scan runs, learned patterns, and lifecycle events — lives under a single /activity route, split into four tabs that share a filter-bar idiom (search, severity pills, status toggles, time-window shortcuts, URL-driven state).
Investigations — every investigation, filterable by severity / status / service / time. URL-driven filters mean bookmarks and browser history Just Work.
Scans — every probe tick the scheduler ran, with trigger (cron / manual / webhook), status, hits dispatched, and a deep link to each run's Probe → Triage → Investigate breakdown. Click into a run from anywhere it's referenced.
Patterns — the learned-pattern catalog. Every confirmed RCA contributes to a service's pattern library, scoped by severity + service, with drill-down to the source investigation that taught it.
Events — the persistent system feed. Investigation lifecycle (investigation_started / _completed / _failed), alert webhooks, scan-run completions, manual scan triggers, and provider health crossings — all backed by a 30-day retention window and filterable by kind / severity / service / time.
Every completed investigation can be delivered to Slack and email. Recipients are filtered independently on two axes — minimum severity and allowed trigger source — so each inbox only sees what it wants.
Slack (via incoming webhook): per-investigation summary posts, plus optional run-level scan summaries (always / hits-only / off).
Email (via SMTP): Teams-safe HTML body that renders the full RCA report — severity banner, summary, root cause with confidence, contributing factors, timeline, evidence (metrics + logs + infra + changes), recommended actions, and a deep link back to the investigation. Plain-text fallback included. Works with Microsoft Teams channel email addresses.
Manage recipients at Settings → Notifications in the UI — add, edit, toggle, and send a fixture RCA through the real pipeline with the per-row Test button.
The default landing page is a live SOC-style console: health strip, service catalog with status chips, investigation log, recent scan runs with a one-click Scan now button, and an event stream rail. Drilling into a service opens a tabbed detail view (metrics, history, dependencies, AI brief).
Live demo: wz.github.io/dops-assistant — fully interactive, hosted on GitHub Pages, no signup. Click into any investigation, browse the service catalog, drill into a scan run. Open the checkout-api incident to watch a completed Deep Investigation replay its move log and cross-service causal chain. Mutations are disabled (it's static), but every read path is real.
Want to run the same thing yourself?
Locally (Node server):
npm install
npm run seed:demo # writes fixture data to data-demo/
npm run demo # boots with DEMO_MODE=true on port 3000As a static site (GitHub Pages — zero infra, zero cost):
npm run build:demo-static # SPA + seed + static JSON snapshots
npx serve dist/web --single # any static server worksThe static build is what the deploy-demo GitHub Actions workflow ships to Pages — see demo/README.md for the one-time setup (a single toggle: Settings → Pages → Source: GitHub Actions). All mutating endpoints are disabled, no LLM calls are made, and no real infrastructure is touched.
- Docker — single image, mount
config.yamlandservices.yaml, passOPENAI_API_KEY - Helm — chart at
deploy/helm/dops-assistant. Supports sub-path ingress viaAPP_BASE_PATH, SMTP creds viaextraEnvFromon an existing Secret, and ingress WebSocket timeout annotations for the ~60s LLM silent-thinking phase - Process manager —
npm run build:web && npm run webbehind systemd, pm2, or your stack of choice
- Architecture Overview — system design, component details, data flow, design decisions
- Autonomous Orchestrator — the Agentic Loop — Deep Investigation's move-loop design: guards, the hybrid-stop keystone, operator-pause protocol, cross-service guard
- Ops Runbook — MCP setup, full config reference, tuning, troubleshooting
- Email Notifications Setup — SMTP, Teams tenant rules, GUI walkthrough
- Provider YAML Spec — writing custom MCP providers
- Changelog — release history
npm run web # web server (loads dev/.env, port 3000)
npm run cli # terminal REPL
npm run build:web # build frontend (Vite → dist/web/)
npm run discover # run AI service discovery
npm run test:discover-eval # score discovery output quality (CI gates at 75/100)
npx tsx src/eval/rca-eval.ts # score RCA report quality
npx vitest run # run tests (100+ files)
npx tsc --noEmit # type checkContributions welcome. Please open an issue first to discuss what you'd like to change.









