Skip to content

WZ/dops-assistant

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

520 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dops-assistant

Live Demo License: MIT Node >= 20 TypeScript ESM

👉 Try the live demo — read-only, seeded with realistic incidents, no signup

AI-powered incident response for DevOps teams. Connects to your monitoring stack via MCP to investigate incidents, scan for problems before they alert, and deliver structured RCA reports with evidence — automatically. Skip the part where you tab between five dashboards at 3am trying to figure out what broke.

Operations Desk — live service catalog, investigation log, scan runs

The Operations Desk — live service catalog, investigation log, recent scan runs, and event rail in one view.

Features

  • Context-enriched alerts, automatically — every trigger (operator chat, Alertmanager webhook, health-poller transition, or scheduled scan) auto-runs a bounded 6-phase RCA pipeline (metrics + logs + infra + changes, in parallel). The page arrives already investigated in ~2 minutes, no human in the loop.
  • Deep Investigation — an autonomous root-cause agent — point it at an incident and it hypothesizes, tests, and follows the cause across service boundaries until it confirms one — or pauses and hands an ambiguous call back to you. Streams live, writes its conclusion back to the report. (Ships gated off.)
  • Proactive scanning — a cron probe evaluates PromQL/LogQL rules per service (availability, restart storms, log-error bursts, custom) and auto-investigates when one trips.
  • AI service discovery — a setup wizard walks your Prometheus/Loki stack and populates the catalog with metrics, log labels, and probe rules (npm run discover for CI).
  • MCP-agnostic — wire in Grafana, Kubernetes, GitLab, Coroot, or any MCP backend, assigned by role (metrics, logs, infra, changes, dependencies).
  • Notifications — Slack + email per completed investigation, with per-recipient severity and source filters (Teams-safe HTML).
  • Operations Desk + Activity — a live SOC console plus a unified /activity view (Investigations · Scans · Patterns · Events); real-time over WebSocket, or a terminal CLI.
  • Multi-stack — prod, staging, and dev side-by-side, each with its own providers, services, rules, and history.
  • LLM resilience — model calls retry with backoff on transient blips and fail loudly (never spin) when the provider stays down.
  • Deploy anywhere — single Docker image, Helm chart, or npm run web.

Quick Start

npm install
cp config.yaml.example config.yaml   # then edit
export OPENAI_API_KEY=sk-...
npm run web                           # port 3000

Open http://localhost:3000. The setup wizard walks you through Connect Provider → Discover Services → Monitor — point it at your Grafana MCP server and the AI populates the service catalog from your Prometheus labels. Headless equivalent: npm run discover.

For the full configuration reference (providers, scan rules, webhooks, notifications, SMTP), see the Ops Runbook or config.yaml.example.

How It Works

dops-assistant gives you two ways to investigate, and they're built for different jobs:

Investigation Pipeline Deep Investigation
What it is Automatic first responder Autonomous root-cause agent
Trigger Every alert / scan / health transition You point it at an incident
Shape Fixed 6-phase pass, runs once Unbounded reason→act loop
Goal What's wrong — context-enriched alert What's the real cause — across services
Cost ~1× (~2 min) 3–10× (guarded)

System Overview

Investigation Pipeline — your automatic first responder

The pipeline is what turns a raw alert into an answer. All four trigger sources (below) converge on the same bounded, 6-phase workflow, so an alert, scan hit, or health transition becomes an evidence-backed RCA report — root cause, timeline, evidence, recommended actions — with no human in the loop. Evidence gathering (metrics, logs, infra, changes) runs in parallel for speed; each agent gets only the MCP tools relevant to its role, so the metrics agent never sees log query tools and vice versa. One pass, ~2–3 minutes. By the time you open the page, the alert is already investigated.

Investigation Flow

Deep Investigation — an autonomous root-cause agent

The pipeline tells you what's wrong on every alert. Deep Investigation answers the harder question — what's the real, underlying cause — for the incidents that don't yield to a single pass.

It's an autonomous orchestrator. Instead of running a fixed sequence, it decides its next move every turn and keeps going until it earns a conclusion:

hypothesize → query evidence → test → follow the cause across services → conclude

Open any finished investigation and pick Investigate deeply:

  • Challenge this RCA — a fast re-judge of the causes the pipeline ruled out.
  • Full deep investigation — the unbounded hunt. It forms candidate causes, gathers read-only evidence for each, scores them against a deterministic test, and — crucially — follows the cause across service boundaries along your dependency graph. A degraded checkout that's really a starved payments connection pool gets traced to payments, not blamed on the symptom.

Deep Investigation — the autonomous move-loop

It stops honestly. A conclusion only sticks when the leading hypothesis is deterministically confirmed by evidence — the model's own confidence is never the gate. When it rules out every idea it had without finding the cause, it doesn't guess: it pauses and asks you — continue, escalate to on-call, or instrument & wait. Every other exit is a hard guard (token budget, wall-clock, query cap, consecutive strikes), because an autonomous run costs 3–10× a normal one.

Deep Investigation streaming live in the Console — move log, live progress, and the moves/queries/tokens footer

A Full deep investigation streaming live in the Console — each move (hypothesize → query → test) lands in real time, with a live progress indicator and the moves · queries · subagents · strikes · tokens footer.

It shows its work. A confirmed run produces a causal chain — incident → each followed dependency (with the finding that pointed there) → root cause (with a Grafana deep-link to the exact panel that confirmed it) — plus a one-line trace summary like 12 moves · 5 queries · 2 subagents · confirmed at depth 1.

A confirmed Deep Investigation — outcome banner, causal chain with source attribution, and trace summary

A confirmed run: the outcome banner, the cross-service causal chain with source attribution, and the trace summary.

Runs are durable — they survive a reload, reattach live across tabs, and auto-park when nobody's watching so they don't burn tokens headless. When it confirms a cause, one click applies it back to the RCA report — operator-gated and reversible, with a banner preserving the original cause.

Deep Investigation ships gated off (agent.autonomousInvestigationEnabled: false) while it finishes live validation. The full move-loop design — every guard, the operator-pause sequence, the false-confirm cross-service guard — is in Autonomous Orchestrator — the Agentic Loop.

Triggers

Investigations can start four ways.

1. Operator (Web UI or CLI)

Type a message like admin-task is returning 500 errors since 4pm. The intent router detects an incident report and launches the full investigation pipeline with your message as context. Results stream back in real time.

2. Alert Webhook (Alertmanager)

Generate a bearer token in Settings -> Alert Webhooks, then point Alertmanager at the webhook URL shown for that stack. The default stack uses POST /api/webhook/alert; non-default stacks use POST /api/webhook/alert/<stack-slug>. The handler validates the stack-scoped bearer token, dedups recent alerts, extracts service/severity/labels, merges with the service's known metrics and log selectors from services.yaml, and runs a headless investigation.

# alertmanager.yml
receivers:
  - name: alert-assistant
    webhook_configs:
      - url: http://assistant:3000/api/webhook/alert/<stack-slug>
        http_config:
          authorization:
            type: Bearer
            credentials: "<your-token>"

Investigation depth (quick / standard / full) can be mapped per severity in config.

3. Health Poller

A background poller queries Prometheus every 60 seconds for deployment replica counts and up metrics. When a service transitions from healthy to down it auto-fires a quick investigation. No extra config — the poller uses the service registry and each service's known selectors.

4. Proactive Scan

A cron-scheduled probe walks every service on the cadence you set and evaluates four tracks of rules:

  1. Global rules — stack-wide availability written by the discovery agent, aware of whichever label key your stack actually uses (app, service, job, deployment)
  2. Per-service metric rules — discovery-written thresholds like pod-restart storms, using each service's real Kubernetes namespace
  3. Per-service log rules — LogQL count_over_time(... |= error) scoped to the service's real Loki labels
  4. Config-file defaults — hardcoded fallback rules from config.yaml when discovery hasn't run

Each rule has hysteresis (consecutive-tick counters) so a single flap doesn't fire a scan. When a rule trips, the probe spawns a headless investigation and the scan run lands on the Operations Desk with status, phase breakdown, and links to each child investigation.

Every tick creates a durable ScanRun record at /scan/runs/:id — copy the link, download as PNG or Markdown, or fire it to Slack with one click.

Scan run detail — Probe → Triage → Investigate breakdown, with dispatched investigations

A scan run that dispatched two investigations — the 3-phase probe/triage/investigate breakdown is preserved forever.

Trigger Context Depth Requires
Operator High (natural language + time refs) Configurable Nothing extra
Alert webhook Medium (alert labels + service config) Per-severity template Alertmanager config
Health poller Medium (transition info + service config) Quick Prometheus provider
K8s event poller Medium (pod restart + reason + service config) Standard Infrastructure (k8s) MCP provider
Proactive scan Medium (rule trigger + service config) Configurable per rule scan.enabled: true

Investigations

Every investigation gets a shareable URL, a phase rail that streams live, and a structured RCA report with root cause, contributing factors, timeline, evidence (metrics + logs + infra + changes), and recommended actions.

Investigation detail — RCA report with timeline + evidence

Activity

Everything that happens in the system — investigations, scan runs, learned patterns, and lifecycle events — lives under a single /activity route, split into four tabs that share a filter-bar idiom (search, severity pills, status toggles, time-window shortcuts, URL-driven state).

Investigations — every investigation, filterable by severity / status / service / time. URL-driven filters mean bookmarks and browser history Just Work.

Activity → Investigations — filter bar, severity pills, severity-striped rows

Scans — every probe tick the scheduler ran, with trigger (cron / manual / webhook), status, hits dispatched, and a deep link to each run's Probe → Triage → Investigate breakdown. Click into a run from anywhere it's referenced.

Activity → Scans — paginated history with trigger / hits / status filters

Patterns — the learned-pattern catalog. Every confirmed RCA contributes to a service's pattern library, scoped by severity + service, with drill-down to the source investigation that taught it.

Activity → Patterns — learned-pattern catalog with severity + service filters

Events — the persistent system feed. Investigation lifecycle (investigation_started / _completed / _failed), alert webhooks, scan-run completions, manual scan triggers, and provider health crossings — all backed by a 30-day retention window and filterable by kind / severity / service / time.

Activity → Events — persistent feed across all four trigger sources

Notifications

Every completed investigation can be delivered to Slack and email. Recipients are filtered independently on two axes — minimum severity and allowed trigger source — so each inbox only sees what it wants.

Slack (via incoming webhook): per-investigation summary posts, plus optional run-level scan summaries (always / hits-only / off).

Email (via SMTP): Teams-safe HTML body that renders the full RCA report — severity banner, summary, root cause with confidence, contributing factors, timeline, evidence (metrics + logs + infra + changes), recommended actions, and a deep link back to the investigation. Plain-text fallback included. Works with Microsoft Teams channel email addresses.

Manage recipients at Settings → Notifications in the UI — add, edit, toggle, and send a fixture RCA through the real pipeline with the per-row Test button.

Settings — Notifications tab with Slack + email

Operations Desk

The default landing page is a live SOC-style console: health strip, service catalog with status chips, investigation log, recent scan runs with a one-click Scan now button, and an event stream rail. Drilling into a service opens a tabbed detail view (metrics, history, dependencies, AI brief).

Try It

Live demo: wz.github.io/dops-assistant — fully interactive, hosted on GitHub Pages, no signup. Click into any investigation, browse the service catalog, drill into a scan run. Open the checkout-api incident to watch a completed Deep Investigation replay its move log and cross-service causal chain. Mutations are disabled (it's static), but every read path is real.

Want to run the same thing yourself?

Locally (Node server):

npm install
npm run seed:demo           # writes fixture data to data-demo/
npm run demo                # boots with DEMO_MODE=true on port 3000

As a static site (GitHub Pages — zero infra, zero cost):

npm run build:demo-static   # SPA + seed + static JSON snapshots
npx serve dist/web --single # any static server works

The static build is what the deploy-demo GitHub Actions workflow ships to Pages — see demo/README.md for the one-time setup (a single toggle: Settings → Pages → Source: GitHub Actions). All mutating endpoints are disabled, no LLM calls are made, and no real infrastructure is touched.

Deployment

  • Docker — single image, mount config.yaml and services.yaml, pass OPENAI_API_KEY
  • Helm — chart at deploy/helm/dops-assistant. Supports sub-path ingress via APP_BASE_PATH, SMTP creds via extraEnvFrom on an existing Secret, and ingress WebSocket timeout annotations for the ~60s LLM silent-thinking phase
  • Process managernpm run build:web && npm run web behind systemd, pm2, or your stack of choice

Documentation

Development

npm run web                 # web server (loads dev/.env, port 3000)
npm run cli                 # terminal REPL
npm run build:web           # build frontend (Vite → dist/web/)
npm run discover            # run AI service discovery
npm run test:discover-eval  # score discovery output quality (CI gates at 75/100)
npx tsx src/eval/rca-eval.ts   # score RCA report quality
npx vitest run              # run tests (100+ files)
npx tsc --noEmit            # type check

Contributing

Contributions welcome. Please open an issue first to discuss what you'd like to change.

License

MIT

About

AI-powered DevOps assistant multiple MCP integration

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors