Skip to content

fix: batch recovery sweep updates and add LIMIT to sweep queries#825

Merged
ColeMurray merged 4 commits into
mainfrom
fix/scheduler-recovery-sweep-batch-and-limit
Jun 24, 2026
Merged

fix: batch recovery sweep updates and add LIMIT to sweep queries#825
ColeMurray merged 4 commits into
mainfrom
fix/scheduler-recovery-sweep-batch-and-limit

Conversation

@ColeMurray

@ColeMurray ColeMurray commented Jun 24, 2026

Copy link
Copy Markdown
Owner

Summary

Addresses fix 2 from the D1 timeout debugging analysis: the recovery sweep ran unbounded queries and issued N×3 sequential D1 writes per tick, creating a feedback loop under load (timeouts → orphans → recovery write load → more timeouts).

LIMIT on recovery sweep queries

getOrphanedStartingRuns and getTimedOutRunningRuns now accept a limit parameter (defaulting to RECOVERY_SWEEP_LIMIT = 50 per sweep type). The SQL constants (ORPHANED_STARTING_RUNS_SQL, TIMED_OUT_RUNNING_RUNS_SQL) are unchanged so EXPLAIN QUERY PLAN integration tests continue to pass — LIMIT ? is appended in the method body.

Bulk operations replace per-run loop

The recoverySweep() method was refactored:

  • Before: N runs × 2-3 sequential D1 queries each (updateRunincrementConsecutiveFailures (UPDATE+SELECT) → optional autoPause)
  • After: 2 bulkFailRuns calls (one per sweep type) + 1 bulkIncrementFailures call = 3 D1 round-trips total, regardless of N

New AutomationStore methods:

  • bulkFailRuns(runIds, reason, completedAt) — single UPDATE ... WHERE id IN (...) marking all recovered runs as failed in one query
  • bulkIncrementFailures(automationIdCounts) — batch UPDATE via db.batch() (one transaction for all automations), then a single SELECT to read back updated counts for auto-pause threshold checks. Replaces N×(UPDATE+SELECT) with 2 round-trips

Additional improvements:

  • Both sweep queries now run in parallel via Promise.all (they're independent)
  • executionTimeoutMs computation moved before the queries (was between them)
  • Error handling: bulkFailRuns failure returns early (runs stay in starting/running state, retried next tick); bulkIncrementFailures failure is best-effort (runs already marked failed, counts may be under-counted — logged and swallowed, consistent with the failRunAndTrackBestEffort pattern from fix 1)
  • failRunAndTrack / failRunAndTrackBestEffort left intact for their other call sites (tick inner catch, event catch, trigger catch, run-complete)

Files changed

  • packages/control-plane/src/db/automation-store.tslimit param on sweep queries + new bulkFailRuns / bulkIncrementFailures methods
  • packages/control-plane/src/scheduler/durable-object.tsRECOVERY_SWEEP_LIMIT constant + refactored recoverySweep()
  • packages/control-plane/src/scheduler/durable-object.test.ts — updated 2 existing recovery tests + 4 new unit tests (batching, auto-pause via bulk increment, bulk_fail_error swallow, bulk_track_error swallow)
  • packages/control-plane/test/integration/automation-store.test.ts — updated direct calls to pass limit param
  • packages/control-plane/test/integration/scheduler.test.ts — new integration test: bulk recovery of multiple orphaned runs for the same automation (verifies all runs fail + consecutive_failures increments by 3)

Verification

  • npm run typecheck — all packages pass
  • npm test -w @open-inspect/control-plane — 1329 unit tests pass (30 scheduler tests, was 25)
  • npm run test:integration -w @open-inspect/control-plane — 371 pass, 2-3 pre-existing flaky failures (session-lifecycle, websocket-client, websocket-sandbox) confirmed unrelated to scheduler/D1 changes — all scheduler and automation-store integration tests pass including the new bulk recovery test
  • ESLint + Prettier clean

Created with Open-Inspect

Summary by CodeRabbit

  • New Features
    • Scheduler recovery sweeps now process orphaned and timed-out runs in bounded, oldest-first batches per tick.
    • Bulk recovery updates multiple runs together, ensuring consistent consecutive-failure counting and auto-pausing once the configured threshold is reached.
  • Bug Fixes
    • Recovery now caps how many runs are selected each tick, preventing unexpected over-processing.
    • Failure tracking and auto-pausing are more resilient: errors are logged per recovery category and won’t block other recovery paths.
  • Tests
    • Added/updated coverage for limit-and-order selection and bulk recovery behavior (including multi-run and error scenarios).

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 25bbd628-ecc1-4f38-9796-4841bbbc5713

📥 Commits

Reviewing files that changed from the base of the PR and between 42bbf91 and e9ddb69.

📒 Files selected for processing (2)
  • packages/control-plane/src/scheduler/durable-object.test.ts
  • packages/control-plane/src/scheduler/durable-object.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • packages/control-plane/src/scheduler/durable-object.test.ts
  • packages/control-plane/src/scheduler/durable-object.ts

📝 Walkthrough

Walkthrough

The scheduler recovery sweep now uses bounded bulk recovery. AutomationStore adds bulk failure and bulk increment methods, the recovery queries accept a limit, and scheduler and integration tests cover the new bulk paths and ordering.

Changes

Bulk Recovery Sweep

Layer / File(s) Summary
AutomationStore bulk methods and bounded queries
packages/control-plane/src/db/automation-store.ts, packages/control-plane/src/db/automation-store.test.ts
Adds bulkFailRuns and bulkIncrementFailures, and updates getOrphanedStartingRuns and getTimedOutRunningRuns to accept a limit and apply bounded ordering in SQL. Unit tests pass 50 as the new limit argument.
recoverySweep bulk rewrite with RECOVERY_SWEEP_LIMIT
packages/control-plane/src/scheduler/durable-object.ts
Adds RECOVERY_SWEEP_LIMIT = 50 and rewrites recoverySweep to fetch both recovery categories concurrently, bulk-fail each category, bulk-increment failure counts, and auto-pause automations that reach the threshold.
Scheduler unit tests for bulk recovery
packages/control-plane/src/scheduler/durable-object.test.ts
Extends the store mock with bulk methods, updates existing recovery assertions to use bulk APIs, and adds tests for batching, auto-pause logging, and bulk-fail and bulk-track error handling.
Integration coverage for bounded recovery
packages/control-plane/test/integration/automation-store.test.ts, packages/control-plane/test/integration/scheduler.test.ts
Passes the new limit argument to recovery queries, adds ordering assertions for bounded sweep results, and adds an integration test that recovers three orphaned runs in one tick.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • ColeMurray/background-agents#789: Modifies the same getOrphanedStartingRuns and getTimedOutRunningRuns query methods in automation-store.ts, making it directly related to the limit changes here.

Poem

🐇 I hopped through the sweep with a limit of fifty,
Bulk-failing the stragglers, tidy and steady.
One tick, many runs, and the counts all align,
The scheduler hums softly — that rhythm is mine.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly captures the main change: batching recovery sweep updates and adding limits to sweep queries.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/scheduler-recovery-sweep-batch-and-limit

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@github-actions

Copy link
Copy Markdown

Terraform Validation Results

Step Status
Format
Init
Validate

Note: Terraform plan was skipped because secrets are not configured. This is expected for external contributors. See docs/GETTING_STARTED.md for setup instructions.

Pushed by: @ColeMurray, Action: pull_request

Comment thread packages/control-plane/src/scheduler/durable-object.ts Outdated

@open-inspect open-inspect Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deep maintainability review focused on recovery sweep structure and backpressure semantics.

Comment thread packages/control-plane/src/scheduler/durable-object.ts Outdated
Comment thread packages/control-plane/src/db/automation-store.ts Outdated
open-inspect[bot]
open-inspect Bot previously requested changes Jun 24, 2026

@open-inspect open-inspect Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Reviewed PR #825, fix: batch recovery sweep updates and add LIMIT to sweep queries, by ColeMurray. The change is a solid direction for reducing D1 round-trips in recovery sweeps across 6 files (+296/-34), but I found one blocking error-path issue in the new bulk failure flow.

Critical Issues

  • [Correctness] packages/control-plane/src/scheduler/durable-object.ts:278 - The two bulkFailRuns calls are separate writes inside a single try; if the first category succeeds and the second throws, the method returns before incrementing failure counts for rows already marked failed. Those rows will not be recovered again on the next tick, so failure accounting and auto-pause can be permanently skipped for that partial-success batch.

Suggestions

  • [Maintainability] packages/control-plane/src/db/automation-store.ts:502 - Consider adding an explicit ORDER BY created_at / ORDER BY started_at before LIMIT so recovery order is deterministic and not dependent on planner/index behavior.

Positive Feedback

  • The new tests cover the main bulk recovery happy path, auto-pause threshold behavior, and swallowed bulk tracking errors.
  • Limiting sweep reads and batching writes directly addresses the D1 timeout feedback loop described in the PR.

Verification

  • Ran npm test -w @open-inspect/control-plane -- src/db/automation-store.test.ts src/scheduler/durable-object.test.ts: 53 tests passed.

Verdict

Request Changes: please fix the partial bulk-failure accounting case before merge.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/control-plane/src/db/automation-store.ts`:
- Around line 502-518: The bounded recovery queries in AutomationStore should
drain the oldest stuck runs first instead of selecting an arbitrary capped set.
Update getOrphanedStartingRuns and getTimedOutRunningRuns so the SQL referenced
by AutomationStore.ORPHANED_STARTING_RUNS_SQL and
AutomationStore.TIMED_OUT_RUNNING_RUNS_SQL includes an explicit ORDER BY on the
age/timestamp column before the LIMIT ? is applied. Keep the existing cutoff
bind logic, but ensure the ordering makes the sweep process oldest rows first.
- Around line 371-394: The readback query in bulkIncrementFailures is not
aligned with the UPDATE because it returns soft-deleted automations that were
intentionally skipped. Update the SELECT in bulkIncrementFailures to also filter
on deleted_at IS NULL, matching the UPDATE condition so recoverySweep only
receives currently active automations and does not act on stale
consecutive_failures values.

In `@packages/control-plane/src/scheduler/durable-object.ts`:
- Around line 278-324: Do not return immediately from the recovery sweep when
one bulkFailRuns call fails in durable-object.ts; the current try/catch around
the orphaned and timedOut failures prevents bulkIncrementFailures from running
after a partial success. Update the recovery flow in the scheduler recovery
logic so each bulkFailRuns call is handled independently (or otherwise track
partial success) and still accumulate failure counts for any runs that were
successfully marked failed, then keep the existing
bulkIncrementFailures/autoPause path using the affected runs from the sweep.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0940dcbe-15ea-4e46-83ea-4a5211ba9f96

📥 Commits

Reviewing files that changed from the base of the PR and between a6f0618 and 0fc5287.

📒 Files selected for processing (6)
  • packages/control-plane/src/db/automation-store.test.ts
  • packages/control-plane/src/db/automation-store.ts
  • packages/control-plane/src/scheduler/durable-object.test.ts
  • packages/control-plane/src/scheduler/durable-object.ts
  • packages/control-plane/test/integration/automation-store.test.ts
  • packages/control-plane/test/integration/scheduler.test.ts

Comment thread packages/control-plane/src/db/automation-store.ts
Comment thread packages/control-plane/src/db/automation-store.ts
Comment thread packages/control-plane/src/scheduler/durable-object.ts Outdated
- Track recoveredRuns per bulkFailRuns call so partial success (orphaned
  succeeds, timed-out throws) still increments failure counts for the
  succeeded category instead of returning early
- Add ORDER BY created_at/started_at ASC before LIMIT so bounded sweeps
  drain oldest stuck runs first instead of an arbitrary set
- Add deleted_at IS NULL to bulkIncrementFailures readback SELECT to
  match the UPDATE filter and avoid acting on stale counts
@github-actions

Copy link
Copy Markdown

Terraform Validation Results

Step Status
Format
Init
Validate

Note: Terraform plan was skipped because secrets are not configured. This is expected for external contributors. See docs/GETTING_STARTED.md for setup instructions.

Pushed by: @open-inspect[bot], Action: pull_request

CI typecheck failed: src/scheduler/durable-object.ts(277,26): error TS2552:
Cannot find name 'AutomationRunRow'. The type was used for the recoveredRuns
array added in 039edae but not added to the existing automation-store import.
@github-actions

Copy link
Copy Markdown

Terraform Validation Results

Step Status
Format
Init
Validate

Note: Terraform plan was skipped because secrets are not configured. This is expected for external contributors. See docs/GETTING_STARTED.md for setup instructions.

Pushed by: @open-inspect[bot], Action: pull_request

@github-actions

Copy link
Copy Markdown

Terraform Validation Results

Step Status
Format
Init
Validate

Note: Terraform plan was skipped because secrets are not configured. This is expected for external contributors. See docs/GETTING_STARTED.md for setup instructions.

Pushed by: @ColeMurray, Action: pull_request

@ColeMurray ColeMurray merged commit 1e99113 into main Jun 24, 2026
18 checks passed
@ColeMurray ColeMurray deleted the fix/scheduler-recovery-sweep-batch-and-limit branch June 24, 2026 07:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant