fix: batch recovery sweep updates and add LIMIT to sweep queries#825
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughThe scheduler recovery sweep now uses bounded bulk recovery. ChangesBulk Recovery Sweep
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Terraform Validation Results
Pushed by: @ColeMurray, Action: |
There was a problem hiding this comment.
Summary
Reviewed PR #825, fix: batch recovery sweep updates and add LIMIT to sweep queries, by ColeMurray. The change is a solid direction for reducing D1 round-trips in recovery sweeps across 6 files (+296/-34), but I found one blocking error-path issue in the new bulk failure flow.
Critical Issues
- [Correctness]
packages/control-plane/src/scheduler/durable-object.ts:278- The twobulkFailRunscalls are separate writes inside a singletry; if the first category succeeds and the second throws, the method returns before incrementing failure counts for rows already marked failed. Those rows will not be recovered again on the next tick, so failure accounting and auto-pause can be permanently skipped for that partial-success batch.
Suggestions
- [Maintainability]
packages/control-plane/src/db/automation-store.ts:502- Consider adding an explicitORDER BY created_at/ORDER BY started_atbeforeLIMITso recovery order is deterministic and not dependent on planner/index behavior.
Positive Feedback
- The new tests cover the main bulk recovery happy path, auto-pause threshold behavior, and swallowed bulk tracking errors.
- Limiting sweep reads and batching writes directly addresses the D1 timeout feedback loop described in the PR.
Verification
- Ran
npm test -w @open-inspect/control-plane -- src/db/automation-store.test.ts src/scheduler/durable-object.test.ts: 53 tests passed.
Verdict
Request Changes: please fix the partial bulk-failure accounting case before merge.
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/control-plane/src/db/automation-store.ts`:
- Around line 502-518: The bounded recovery queries in AutomationStore should
drain the oldest stuck runs first instead of selecting an arbitrary capped set.
Update getOrphanedStartingRuns and getTimedOutRunningRuns so the SQL referenced
by AutomationStore.ORPHANED_STARTING_RUNS_SQL and
AutomationStore.TIMED_OUT_RUNNING_RUNS_SQL includes an explicit ORDER BY on the
age/timestamp column before the LIMIT ? is applied. Keep the existing cutoff
bind logic, but ensure the ordering makes the sweep process oldest rows first.
- Around line 371-394: The readback query in bulkIncrementFailures is not
aligned with the UPDATE because it returns soft-deleted automations that were
intentionally skipped. Update the SELECT in bulkIncrementFailures to also filter
on deleted_at IS NULL, matching the UPDATE condition so recoverySweep only
receives currently active automations and does not act on stale
consecutive_failures values.
In `@packages/control-plane/src/scheduler/durable-object.ts`:
- Around line 278-324: Do not return immediately from the recovery sweep when
one bulkFailRuns call fails in durable-object.ts; the current try/catch around
the orphaned and timedOut failures prevents bulkIncrementFailures from running
after a partial success. Update the recovery flow in the scheduler recovery
logic so each bulkFailRuns call is handled independently (or otherwise track
partial success) and still accumulate failure counts for any runs that were
successfully marked failed, then keep the existing
bulkIncrementFailures/autoPause path using the affected runs from the sweep.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0940dcbe-15ea-4e46-83ea-4a5211ba9f96
📒 Files selected for processing (6)
packages/control-plane/src/db/automation-store.test.tspackages/control-plane/src/db/automation-store.tspackages/control-plane/src/scheduler/durable-object.test.tspackages/control-plane/src/scheduler/durable-object.tspackages/control-plane/test/integration/automation-store.test.tspackages/control-plane/test/integration/scheduler.test.ts
- Track recoveredRuns per bulkFailRuns call so partial success (orphaned succeeds, timed-out throws) still increments failure counts for the succeeded category instead of returning early - Add ORDER BY created_at/started_at ASC before LIMIT so bounded sweeps drain oldest stuck runs first instead of an arbitrary set - Add deleted_at IS NULL to bulkIncrementFailures readback SELECT to match the UPDATE filter and avoid acting on stale counts
Terraform Validation Results
Pushed by: @open-inspect[bot], Action: |
CI typecheck failed: src/scheduler/durable-object.ts(277,26): error TS2552: Cannot find name 'AutomationRunRow'. The type was used for the recoveredRuns array added in 039edae but not added to the existing automation-store import.
Terraform Validation Results
Pushed by: @open-inspect[bot], Action: |
Terraform Validation Results
Pushed by: @ColeMurray, Action: |
Summary
Addresses fix 2 from the D1 timeout debugging analysis: the recovery sweep ran unbounded queries and issued N×3 sequential D1 writes per tick, creating a feedback loop under load (timeouts → orphans → recovery write load → more timeouts).
LIMIT on recovery sweep queries
getOrphanedStartingRunsandgetTimedOutRunningRunsnow accept alimitparameter (defaulting toRECOVERY_SWEEP_LIMIT = 50per sweep type). The SQL constants (ORPHANED_STARTING_RUNS_SQL,TIMED_OUT_RUNNING_RUNS_SQL) are unchanged so EXPLAIN QUERY PLAN integration tests continue to pass —LIMIT ?is appended in the method body.Bulk operations replace per-run loop
The
recoverySweep()method was refactored:updateRun→incrementConsecutiveFailures(UPDATE+SELECT) → optionalautoPause)bulkFailRunscalls (one per sweep type) + 1bulkIncrementFailurescall = 3 D1 round-trips total, regardless of NNew
AutomationStoremethods:bulkFailRuns(runIds, reason, completedAt)— singleUPDATE ... WHERE id IN (...)marking all recovered runs as failed in one querybulkIncrementFailures(automationIdCounts)— batch UPDATE viadb.batch()(one transaction for all automations), then a single SELECT to read back updated counts for auto-pause threshold checks. Replaces N×(UPDATE+SELECT) with 2 round-tripsAdditional improvements:
Promise.all(they're independent)executionTimeoutMscomputation moved before the queries (was between them)bulkFailRunsfailure returns early (runs stay in starting/running state, retried next tick);bulkIncrementFailuresfailure is best-effort (runs already marked failed, counts may be under-counted — logged and swallowed, consistent with thefailRunAndTrackBestEffortpattern from fix 1)failRunAndTrack/failRunAndTrackBestEffortleft intact for their other call sites (tick inner catch, event catch, trigger catch, run-complete)Files changed
packages/control-plane/src/db/automation-store.ts—limitparam on sweep queries + newbulkFailRuns/bulkIncrementFailuresmethodspackages/control-plane/src/scheduler/durable-object.ts—RECOVERY_SWEEP_LIMITconstant + refactoredrecoverySweep()packages/control-plane/src/scheduler/durable-object.test.ts— updated 2 existing recovery tests + 4 new unit tests (batching, auto-pause via bulk increment, bulk_fail_error swallow, bulk_track_error swallow)packages/control-plane/test/integration/automation-store.test.ts— updated direct calls to passlimitparampackages/control-plane/test/integration/scheduler.test.ts— new integration test: bulk recovery of multiple orphaned runs for the same automation (verifies all runs fail +consecutive_failuresincrements by 3)Verification
npm run typecheck— all packages passnpm test -w @open-inspect/control-plane— 1329 unit tests pass (30 scheduler tests, was 25)npm run test:integration -w @open-inspect/control-plane— 371 pass, 2-3 pre-existing flaky failures (session-lifecycle, websocket-client, websocket-sandbox) confirmed unrelated to scheduler/D1 changes — all scheduler and automation-store integration tests pass including the new bulk recovery testCreated with Open-Inspect
Summary by CodeRabbit