Skip to content

feat(controller): observable vault scanner skip failures (spec 043)#14

Merged
bborbe merged 3 commits into
masterfrom
feature/vault-scanner-skip-alerts
Jun 1, 2026
Merged

feat(controller): observable vault scanner skip failures (spec 043)#14
bborbe merged 3 commits into
masterfrom
feature/vault-scanner-skip-alerts

Conversation

@bborbe
Copy link
Copy Markdown
Owner

@bborbe bborbe commented Jun 1, 2026

Summary

Restores Prometheus observability for vault task files silently skipped by the scanner. Closes the dedup safety net's silent failure mode that hid 24 broken files (4 of them human_review PR-review tasks) across the 2026-05-31 / 2026-06-01 incident.

  • New counter agent_controller_vault_scanner_skipped_files_total{reason} with five pre-initialised label values (invalid_frontmatter, duplicate_frontmatter_invalid, empty_status, inject_task_identifier_failed, read_failed); increments exactly once per skipped file per scan cycle.
  • Promoted four glog.Warningf("skipping …") sites to glog.Errorf so existing log-scrape alert routing surfaces the event. read_failed and inject_task_identifier_failed remain Warningf (counter alone is sufficient — both are rare, both are operator-actionable via the metric).
  • Spec: specs/in-progress/043-surface-vault-scanner-skip-failures.md. Doctrine advanced: [[Make Parked Agent Tasks Visible to Operator]]. Systematic-prevention half of [[Fix unresolved merge conflicts in OpenClaw vault]].

Test plan

  • Unit tests green — scanner pkg coverage 88.8%, metrics pkg 68.8%, all 40 specs pass
  • make precommit exits 0 (lint funlen suppressed with justification on Run and processFile)
  • Dev deploy via BRANCH=dev make buca — image SHA ef52275a… running on agent-task-controller-0, startup log reports version=v0.63.35-19-g3508338 commit=3508338
  • Post-merge: redeploy :dev from agent-dev worktree to realign with master; then deploy to prod
  • Operator-driven AC#13 — plant a test file with <<<<<<< markers in dev OpenClaw vault, observe invalid_frontmatter counter increment + glog.Errorf log line, then remove

bborbe added 3 commits June 1, 2026 22:37
…n on spec-043 funcs

- Remove task/controller/controller (43MB) and scanner.test (40MB) committed by prompt-195 container
- Extend .gitignore with *.test and explicit task/controller/controller / task/executor/executor entries
- Add //nolint:funlen on main.go Run() (+6 lines from metrics.New() wiring) and vault_scanner.go processFile() (+5 statements from spec-043 counter calls); extraction would split tightly-coupled wiring or duplicate per-site logic
- Pick up post-precommit formatter cleanups (gofmt const alignment in metrics.go, golines wrap in vault_scanner_internal_test.go) and the spec AC#6 awk regex fix from pre-approval review
Copy link
Copy Markdown

@ben-s-pull-request-reviewer ben-s-pull-request-reviewer Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

Branch: feature/vault-scanner-skip-alertsmaster
PR: #14

Code Review Findings

Correctness:

  • 4 glog.Warningf("skipping...") sites correctly promoted to glog.Errorf (lines 239, 245, 254, 299)
  • read_failed (line 229) and inject_task_identifier_failed (line 326) correctly remain at Warningf per spec
  • 6 counter increment calls matching 6 skip sites (5 unique reasons, with invalid_frontmatter used at 2 sites)
  • Counter pre-initialization in init() for all 5 label values (lines 201-209)
  • Constructor signature correctly accepts metrics.Metrics interface

Tests:

  • vault_scanner_test.go has comprehensive SkippedFilesTotal counter Context for all 5 reasons (lines 780-1059)
  • metrics_test.go verifies all 5 reason labels are pre-initialized (lines 98-110)
  • vault_scanner_internal_test.go tests inject_task_identifier_failed counter
  • AC#6 invariant test at line 1035 validates skip-site/counter-call parity

Documentation:

  • docs/controller-design.md updated with counter documentation (line 28)
  • CHANGELOG.md has unreleased entry with incident reference
  • Spec correctly references vault doctrine

No issues found.

{
  "verdict": "approve",
  "summary": "Implementation fully satisfies spec-043. The vault scanner now emits a labelled Prometheus counter (agent_controller_vault_scanner_skipped_files_total{reason}) at all 6 skip sites, with 4 operator-actionable log lines promoted from Warningf to Errorf. All 5 reason labels are pre-initialised to zero. Comprehensive Ginkgo test coverage validates per-reason increment behaviour and the skip-site/counter-call invariant.",
  "comments": [],
  "concerns_addressed": [
    "correctness: 4 glog.Warningf skip sites promoted to glog.Errorf — errorf at lines 239,245,254,299; read_failed and inject_task_identifier_failed correctly remain at Warningf per spec",
    "correctness: Counter increment logic is idempotent per scan cycle — hash check prevents double-counting for empty_status; re-process correctly re-increments for broken files",
    "correctness: 5 label values are a closed enum (constants), no high cardinality risk",
    "tests: All 5 skip reasons covered by Ginkgo Context blocks in vault_scanner_test.go and vault_scanner_internal_test.go",
    "performance: Counter pre-initialisation in init() uses Add(0) for 5 label values — negligible memory overhead"
  ]
}

@bborbe bborbe merged commit e34ee21 into master Jun 1, 2026
9 checks passed
@bborbe bborbe deleted the feature/vault-scanner-skip-alerts branch June 1, 2026 22:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant