GitHub - WZ/dops-assistant: AI-powered DevOps assistant multiple MCP integration

👉 Try the live demo — read-only, seeded with realistic incidents, no signup

AI-powered incident response for DevOps teams. Connects to your monitoring stack via MCP to investigate incidents, scan for problems before they alert, and deliver structured RCA reports with evidence — automatically. Skip the part where you tab between five dashboards at 3am trying to figure out what broke.

The Operations Desk — live service catalog, investigation log, recent scan runs, and event rail in one view.

Features

Context-enriched alerts, automatically — every trigger (operator chat, Alertmanager webhook, health-poller transition, or scheduled scan) auto-runs a bounded 6-phase RCA pipeline (metrics + logs + infra + changes, in parallel). The page arrives already investigated in ~2 minutes, no human in the loop.
Deep Investigation — an autonomous root-cause agent — point it at an incident and it hypothesizes, tests, and follows the cause across service boundaries until it confirms one — or pauses and hands an ambiguous call back to you. Streams live, writes its conclusion back to the report. (Ships gated off.)
Proactive scanning — a cron probe evaluates PromQL/LogQL rules per service (availability, restart storms, log-error bursts, custom) and auto-investigates when one trips.
AI service discovery — a setup wizard walks your Prometheus/Loki stack and populates the catalog with metrics, log labels, and probe rules (npm run discover for CI).
MCP-agnostic — wire in Grafana, Kubernetes, GitLab, Coroot, or any MCP backend, assigned by role (metrics, logs, infra, changes, dependencies).
Notifications — Slack + email per completed investigation, with per-recipient severity and source filters (Teams-safe HTML).
Operations Desk + Activity — a live SOC console plus a unified /activity view (Investigations · Scans · Patterns · Events); real-time over WebSocket, or a terminal CLI.
Multi-stack — prod, staging, and dev side-by-side, each with its own providers, services, rules, and history.
LLM resilience — model calls retry with backoff on transient blips and fail loudly (never spin) when the provider stays down.
Deploy anywhere — single Docker image, Helm chart, or npm run web.

Quick Start

npm install
cp config.yaml.example config.yaml   # then edit
export OPENAI_API_KEY=sk-...
npm run web                           # port 3000

Open http://localhost:3000. The setup wizard walks you through Connect Provider → Discover Services → Monitor — point it at your Grafana MCP server and the AI populates the service catalog from your Prometheus labels. Headless equivalent: npm run discover.

For the full configuration reference (providers, scan rules, webhooks, notifications, SMTP), see the Ops Runbook or config.yaml.example.

How It Works

dops-assistant gives you two ways to investigate, and they're built for different jobs:

	Investigation Pipeline	Deep Investigation
What it is	Automatic first responder	Autonomous root-cause agent
Trigger	Every alert / scan / health transition	You point it at an incident
Shape	Fixed 6-phase pass, runs once	Unbounded reason→act loop
Goal	What's wrong — context-enriched alert	What's the real cause — across services
Cost	~1× (~2 min)	3–10× (guarded)

Investigation Pipeline — your automatic first responder

The pipeline is what turns a raw alert into an answer. All four trigger sources (below) converge on the same bounded, 6-phase workflow, so an alert, scan hit, or health transition becomes an evidence-backed RCA report — root cause, timeline, evidence, recommended actions — with no human in the loop. Evidence gathering (metrics, logs, infra, changes) runs in parallel for speed; each agent gets only the MCP tools relevant to its role, so the metrics agent never sees log query tools and vice versa. One pass, ~2–3 minutes. By the time you open the page, the alert is already investigated.

Deep Investigation — an autonomous root-cause agent

The pipeline tells you what's wrong on every alert. Deep Investigation answers the harder question — what's the real, underlying cause — for the incidents that don't yield to a single pass.

It's an autonomous orchestrator. Instead of running a fixed sequence, it decides its next move every turn and keeps going until it earns a conclusion:

hypothesize → query evidence → test → follow the cause across services → conclude

Open any finished investigation and pick Investigate deeply:

Challenge this RCA — a fast re-judge of the causes the pipeline ruled out.
Full deep investigation — the unbounded hunt. It forms candidate causes, gathers read-only evidence for each, scores them against a deterministic test, and — crucially — follows the cause across service boundaries along your dependency graph. A degraded checkout that's really a starved payments connection pool gets traced to payments, not blamed on the symptom.

It stops honestly. A conclusion only sticks when the leading hypothesis is deterministically confirmed by evidence — the model's own confidence is never the gate. When it rules out every idea it had without finding the cause, it doesn't guess: it pauses and asks you — continue, escalate to on-call, or instrument & wait. Every other exit is a hard guard (token budget, wall-clock, query cap, consecutive strikes), because an autonomous run costs 3–10× a normal one.

A Full deep investigation streaming live in the Console — each move (hypothesize → query → test) lands in real time, with a live progress indicator and the moves · queries · subagents · strikes · tokens footer.

It shows its work. A confirmed run produces a causal chain — incident → each followed dependency (with the finding that pointed there) → root cause (with a Grafana deep-link to the exact panel that confirmed it) — plus a one-line trace summary like 12 moves · 5 queries · 2 subagents · confirmed at depth 1.

A confirmed run: the outcome banner, the cross-service causal chain with source attribution, and the trace summary.

Runs are durable — they survive a reload, reattach live across tabs, and auto-park when nobody's watching so they don't burn tokens headless. When it confirms a cause, one click applies it back to the RCA report — operator-gated and reversible, with a banner preserving the original cause.

Deep Investigation ships gated off (agent.autonomousInvestigationEnabled: false) while it finishes live validation. The full move-loop design — every guard, the operator-pause sequence, the false-confirm cross-service guard — is in Autonomous Orchestrator — the Agentic Loop.

Triggers

Investigations can start four ways.

1. Operator (Web UI or CLI)

Type a message like admin-task is returning 500 errors since 4pm. The intent router detects an incident report and launches the full investigation pipeline with your message as context. Results stream back in real time.

2. Alert Webhook (Alertmanager)

Generate a bearer token in Settings -> Alert Webhooks, then point Alertmanager at the webhook URL shown for that stack. The default stack uses POST /api/webhook/alert; non-default stacks use POST /api/webhook/alert/<stack-slug>. The handler validates the stack-scoped bearer token, dedups recent alerts, extracts service/severity/labels, merges with the service's known metrics and log selectors from services.yaml, and runs a headless investigation.

# alertmanager.yml
receivers:
  - name: alert-assistant
    webhook_configs:
      - url: http://assistant:3000/api/webhook/alert/<stack-slug>
        http_config:
          authorization:
            type: Bearer
            credentials: "<your-token>"

Investigation depth (quick / standard / full) can be mapped per severity in config.

3. Health Poller

A background poller queries Prometheus every 60 seconds for deployment replica counts and up metrics. When a service transitions from healthy to down it auto-fires a quick investigation. No extra config — the poller uses the service registry and each service's known selectors.

4. Proactive Scan

A cron-scheduled probe walks every service on the cadence you set and evaluates four tracks of rules:

Global rules — stack-wide availability written by the discovery agent, aware of whichever label key your stack actually uses (app, service, job, deployment)
Per-service metric rules — discovery-written thresholds like pod-restart storms, using each service's real Kubernetes namespace
Per-service log rules — LogQL count_over_time(... |= error) scoped to the service's real Loki labels
Config-file defaults — hardcoded fallback rules from config.yaml when discovery hasn't run

Each rule has hysteresis (consecutive-tick counters) so a single flap doesn't fire a scan. When a rule trips, the probe spawns a headless investigation and the scan run lands on the Operations Desk with status, phase breakdown, and links to each child investigation.

Every tick creates a durable ScanRun record at /scan/runs/:id — copy the link, download as PNG or Markdown, or fire it to Slack with one click.

A scan run that dispatched two investigations — the 3-phase probe/triage/investigate breakdown is preserved forever.

Trigger	Context	Depth	Requires
Operator	High (natural language + time refs)	Configurable	Nothing extra
Alert webhook	Medium (alert labels + service config)	Per-severity template	Alertmanager config
Health poller	Medium (transition info + service config)	Quick	Prometheus provider
K8s event poller	Medium (pod restart + reason + service config)	Standard	Infrastructure (k8s) MCP provider
Proactive scan	Medium (rule trigger + service config)	Configurable per rule	`scan.enabled: true`

Investigations

Every investigation gets a shareable URL, a phase rail that streams live, and a structured RCA report with root cause, contributing factors, timeline, evidence (metrics + logs + infra + changes), and recommended actions.

Activity

Everything that happens in the system — investigations, scan runs, learned patterns, and lifecycle events — lives under a single /activity route, split into four tabs that share a filter-bar idiom (search, severity pills, status toggles, time-window shortcuts, URL-driven state).

Investigations — every investigation, filterable by severity / status / service / time. URL-driven filters mean bookmarks and browser history Just Work.

Scans — every probe tick the scheduler ran, with trigger (cron / manual / webhook), status, hits dispatched, and a deep link to each run's Probe → Triage → Investigate breakdown. Click into a run from anywhere it's referenced.

Patterns — the learned-pattern catalog. Every confirmed RCA contributes to a service's pattern library, scoped by severity + service, with drill-down to the source investigation that taught it.

Events — the persistent system feed. Investigation lifecycle (investigation_started / _completed / _failed), alert webhooks, scan-run completions, manual scan triggers, and provider health crossings — all backed by a 30-day retention window and filterable by kind / severity / service / time.

Notifications

Every completed investigation can be delivered to Slack and email. Recipients are filtered independently on two axes — minimum severity and allowed trigger source — so each inbox only sees what it wants.

Slack (via incoming webhook): per-investigation summary posts, plus optional run-level scan summaries (always / hits-only / off).

Email (via SMTP): Teams-safe HTML body that renders the full RCA report — severity banner, summary, root cause with confidence, contributing factors, timeline, evidence (metrics + logs + infra + changes), recommended actions, and a deep link back to the investigation. Plain-text fallback included. Works with Microsoft Teams channel email addresses.

Manage recipients at Settings → Notifications in the UI — add, edit, toggle, and send a fixture RCA through the real pipeline with the per-row Test button.

Operations Desk

The default landing page is a live SOC-style console: health strip, service catalog with status chips, investigation log, recent scan runs with a one-click Scan now button, and an event stream rail. Drilling into a service opens a tabbed detail view (metrics, history, dependencies, AI brief).

Try It

Live demo: wz.github.io/dops-assistant — fully interactive, hosted on GitHub Pages, no signup. Click into any investigation, browse the service catalog, drill into a scan run. Open the checkout-api incident to watch a completed Deep Investigation replay its move log and cross-service causal chain. Mutations are disabled (it's static), but every read path is real.

Want to run the same thing yourself?

Locally (Node server):

npm install
npm run seed:demo           # writes fixture data to data-demo/
npm run demo                # boots with DEMO_MODE=true on port 3000

As a static site (GitHub Pages — zero infra, zero cost):

npm run build:demo-static   # SPA + seed + static JSON snapshots
npx serve dist/web --single # any static server works

The static build is what the deploy-demo GitHub Actions workflow ships to Pages — see demo/README.md for the one-time setup (a single toggle: Settings → Pages → Source: GitHub Actions). All mutating endpoints are disabled, no LLM calls are made, and no real infrastructure is touched.

Deployment

Docker — single image, mount config.yaml and services.yaml, pass OPENAI_API_KEY
Helm — chart at deploy/helm/dops-assistant. Supports sub-path ingress via APP_BASE_PATH, SMTP creds via extraEnvFrom on an existing Secret, and ingress WebSocket timeout annotations for the ~60s LLM silent-thinking phase
Process manager — npm run build:web && npm run web behind systemd, pm2, or your stack of choice

Documentation

Architecture Overview — system design, component details, data flow, design decisions
Autonomous Orchestrator — the Agentic Loop — Deep Investigation's move-loop design: guards, the hybrid-stop keystone, operator-pause protocol, cross-service guard
Ops Runbook — MCP setup, full config reference, tuning, troubleshooting
Email Notifications Setup — SMTP, Teams tenant rules, GUI walkthrough
Provider YAML Spec — writing custom MCP providers
Changelog — release history

Development

npm run web                 # web server (loads dev/.env, port 3000)
npm run cli                 # terminal REPL
npm run build:web           # build frontend (Vite → dist/web/)
npm run discover            # run AI service discovery
npm run test:discover-eval  # score discovery output quality (CI gates at 75/100)
npx tsx src/eval/rca-eval.ts   # score RCA report quality
npx vitest run              # run tests (100+ files)
npx tsc --noEmit            # type check

Contributing

Contributions welcome. Please open an issue first to discuss what you'd like to change.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 520 Commits
.github/workflows		.github/workflows
benchmark		benchmark
demo		demo
deploy/helm/dops-assistant		deploy/helm/dops-assistant
docs		docs
e2e		e2e
scripts		scripts
skills		skills
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
DESIGN.md		DESIGN.md
Dockerfile		Dockerfile
README.md		README.md
VERSION		VERSION
components.json		components.json
config.yaml.example		config.yaml.example
docker-compose.yml		docker-compose.yml
package-lock.json		package-lock.json
package.json		package.json
playwright.config.ts		playwright.config.ts
tsconfig.json		tsconfig.json
vite.config.ts		vite.config.ts
vitest.config.ts		vitest.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Features

Quick Start

How It Works

Investigation Pipeline — your automatic first responder

Deep Investigation — an autonomous root-cause agent

Triggers

1. Operator (Web UI or CLI)

2. Alert Webhook (Alertmanager)

3. Health Poller

4. Proactive Scan

Investigations

Activity

Notifications

Operations Desk

Try It

Deployment

Documentation

Development

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Features

Quick Start

How It Works

Investigation Pipeline — your automatic first responder

Deep Investigation — an autonomous root-cause agent

Triggers

1. Operator (Web UI or CLI)

2. Alert Webhook (Alertmanager)

3. Health Poller

4. Proactive Scan

Investigations

Activity

Notifications

Operations Desk

Try It

Deployment

Documentation

Development

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages