fix(extensions): re-attempt transient BundleBuildFailed instead of pinning it (#424)#1440
Conversation
…nning it (#424) A `BundleBuildFailed` catalog row carries an empty `bundle_path`, so in `findStaleFiles` it slipped past the missing-bundle staleness gate when its source fingerprint still matched — and was never re-bundled. A `swamp serve` daemon that first loaded an npm-dependent extension on a cold deno cache with no network (e.g. a fresh deploy) pinned the type as `BundleBuildFailed` and failed every scheduled run with `Unknown model type`, permanently — across restarts and even after connectivity returned, while CLI runs that warmed a good bundle still worked. Treat a fingerprint-matched `BundleBuildFailed` as stale so the next scan re-attempts the bundle and recovers once conditions change. Deterministic `ValidationFailed` stays excluded — re-bundling it only thrashes. The cold-start ReconcileFromDisk path is intentionally unchanged: a failed bundle leaves the catalog populated, so daemon-restart recovery routes through this warm path. Verified against an isolated systemd-run reproduction: a daemon pinned offline now recovers to `Indexed` on the next online restart. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The vault loader test asserted end-to-end recovery of a BundleBuildFailed row through the loader's warm-start sourcePathIndex re-bundle path, using two unrelated temp dirs. That path is sensitive to Windows path canonicalization between unrelated directories and failed on windows-latest. The fix itself (findStaleFiles treating a fingerprint-matched BundleBuildFailed as stale) is covered directly and portably by the new bundle_freshness_test.ts unit test, which passes on all platforms. Its prior assertion also encoded the old pinning behavior, so it could not simply stay. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
Clean, well-scoped fix for a real bug. The root cause analysis is thorough — a BundleBuildFailed row with an empty bundle_path and unchanged fingerprint slipped through both the fingerprint-mismatch gate and the missing-bundle gate, permanently pinning the failure.
Blocking Issues
None.
Suggestions
- The inline comment block at
bundle_freshness.ts:312-318(7 lines) is on the verbose side for this codebase. The first sentence and theBundleBuildFailed-vs-ValidationFaileddistinction are load-bearing; the cold-start/ReconcileFromDisk detail could live in the design doc alone (which it already does). Minor — not blocking.
What looks good
- Fix placement: The new
BundleBuildFailedcheck is correctly positioned after the fingerprint match and before thebundle_path/bundleExistscheck, which is exactly where the bug lived. - Distinction preserved:
ValidationFailed(deterministic) stays excluded from retry;BundleBuildFailed(transient) is now retried. The existingValidationFailed → not staletest guards this boundary. - Test: New unit test directly exercises the exact scenario (matching fingerprint + empty bundle_path +
BundleBuildFailedstate). The removed vault-loader integration test was OS-fragile and is now replaced by a portable, more focused test at the right abstraction layer. - Design doc updated: The warm-start semantics in
design/extension.mdare updated to reflect the new behavior. - DDD: Change is entirely within the domain layer (
src/domain/extensions/), touching only the staleness determination logic where it belongs. No layer violations. - Minimal blast radius: 4 files changed, the fix itself is a single 4-line conditional.
There was a problem hiding this comment.
Adversarial Review
Critical / High
None.
Medium
None.
Low
-
src/domain/extensions/bundle_freshness_test.ts:608— no Windows EBUSY.catch()on cleanup.
The deleted vault loader test usedDeno.remove(...).catch(() => {})on Windows because it held a SQLite handle. The new freshness test only usesFakeCatalog(in-memory) and text files, so EBUSY is not a realistic concern — and the pattern is consistent with every other test in this file. Noting for completeness only. -
src/domain/extensions/bundle_freshness.ts:320— every warm-start scan re-attempts a persistently failingBundleBuildFailedentry.
If the underlying condition never resolves (e.g. a permanently broken npm dependency), every scan will re-invoke the bundler and fail again. This is bounded by scan frequency (not a tight retry loop) and the PR description explicitly acknowledges this tradeoff, so it's an acceptable design choice. If scan frequency ever increases, this could become noisy — but that's a future concern, not a current bug.
Verdict
PASS. Clean, minimal, well-placed fix. The new BundleBuildFailed check is inserted at exactly the right point in the decision chain — after fingerprint matching (so changed sources still take the normal path) and before the bundle-exists check (so the empty bundle_path on a failed row can't accidentally satisfy the old gate). The ValidationFailed exclusion is preserved correctly. The deleted vault loader test was asserting the old (now-incorrect) pinning behavior; its coverage is properly subsumed by the new unit test at the freshness layer. Logic, ordering, and edge cases check out.
Summary
swamp servewould permanently lose a custom (@user/*) model type after a single transient bundle failure. When the daemon first loaded an npm-dependent extension on a cold deno cache with no network (e.g. a fresh deploy, before any CLI run warmed.swamp/bundles/),deno bundlefailed and the catalog recorded a stickyBundleBuildFailedrow. Scheduled workflows then failed every run withUnknown model type: <type>— across daemon restarts and even after connectivity returned — while CLIswamp workflow runthat had warmed a good bundle still worked.Root cause: a
BundleBuildFailedrow carries an emptybundle_path, so infindStaleFilesit slipped past the missing-bundle staleness gate whenever its source fingerprint still matched, and was never re-bundled.Fix: treat a fingerprint-matched
BundleBuildFailedas stale so the next scan re-attempts the bundle and recovers once conditions change. This is not an aggressive retry loop — the next natural scan (restart / next load pass) fixes it. DeterministicValidationFailedstays excluded (re-bundling it only thrashes). The cold-startReconcileFromDiskpath is intentionally unchanged: a failed bundle leaves the catalog populated, so daemon-restart recovery routes through this warm path. The fix is shared across all extension kinds (model/vault/driver/report).Closes #424.
Test Plan
findStaleFiles: matching fingerprint + state=BundleBuildFailed → stale (#424); the existingValidationFailed → not staleregression guard still passes.deno check/lint/fmtclean.systemd-run -p PrivateNetwork=yesnamespace with a coldDENO_DIR: recreated the pinned state (BundleBuildFailed→Unknown model type), then restarted serve online — the daemon recovered on its own (Indexed, bundle rebuilt, scheduled run resolves the type).Out of scope
A separate
datastore-base-path-changeddivergence between serve startup and scheduled-execution contexts (surfaced during reproduction) causes an unnecessary catalog re-bundle on the first scheduled run of each startup. This fix makes that invalidation benign; the churn itself is a follow-up.🤖 Generated with Claude Code