fix: batch recovery sweep updates and add LIMIT to sweep queries by ColeMurray · Pull Request #825 · ColeMurray/background-agents

ColeMurray · 2026-06-24T06:45:58Z

Summary

Addresses fix 2 from the D1 timeout debugging analysis: the recovery sweep ran unbounded queries and issued N×3 sequential D1 writes per tick, creating a feedback loop under load (timeouts → orphans → recovery write load → more timeouts).

LIMIT on recovery sweep queries

getOrphanedStartingRuns and getTimedOutRunningRuns now accept a limit parameter (defaulting to RECOVERY_SWEEP_LIMIT = 50 per sweep type). The SQL constants (ORPHANED_STARTING_RUNS_SQL, TIMED_OUT_RUNNING_RUNS_SQL) are unchanged so EXPLAIN QUERY PLAN integration tests continue to pass — LIMIT ? is appended in the method body.

Bulk operations replace per-run loop

The recoverySweep() method was refactored:

Before: N runs × 2-3 sequential D1 queries each (updateRun → incrementConsecutiveFailures (UPDATE+SELECT) → optional autoPause)
After: 2 bulkFailRuns calls (one per sweep type) + 1 bulkIncrementFailures call = 3 D1 round-trips total, regardless of N

New AutomationStore methods:

bulkFailRuns(runIds, reason, completedAt) — single UPDATE ... WHERE id IN (...) marking all recovered runs as failed in one query
bulkIncrementFailures(automationIdCounts) — batch UPDATE via db.batch() (one transaction for all automations), then a single SELECT to read back updated counts for auto-pause threshold checks. Replaces N×(UPDATE+SELECT) with 2 round-trips

Additional improvements:

Both sweep queries now run in parallel via Promise.all (they're independent)
executionTimeoutMs computation moved before the queries (was between them)
Error handling: bulkFailRuns failure returns early (runs stay in starting/running state, retried next tick); bulkIncrementFailures failure is best-effort (runs already marked failed, counts may be under-counted — logged and swallowed, consistent with the failRunAndTrackBestEffort pattern from fix 1)
failRunAndTrack / failRunAndTrackBestEffort left intact for their other call sites (tick inner catch, event catch, trigger catch, run-complete)

Files changed

packages/control-plane/src/db/automation-store.ts — limit param on sweep queries + new bulkFailRuns / bulkIncrementFailures methods
packages/control-plane/src/scheduler/durable-object.ts — RECOVERY_SWEEP_LIMIT constant + refactored recoverySweep()
packages/control-plane/src/scheduler/durable-object.test.ts — updated 2 existing recovery tests + 4 new unit tests (batching, auto-pause via bulk increment, bulk_fail_error swallow, bulk_track_error swallow)
packages/control-plane/test/integration/automation-store.test.ts — updated direct calls to pass limit param
packages/control-plane/test/integration/scheduler.test.ts — new integration test: bulk recovery of multiple orphaned runs for the same automation (verifies all runs fail + consecutive_failures increments by 3)

Verification

npm run typecheck — all packages pass
npm test -w @open-inspect/control-plane — 1329 unit tests pass (30 scheduler tests, was 25)
npm run test:integration -w @open-inspect/control-plane — 371 pass, 2-3 pre-existing flaky failures (session-lifecycle, websocket-client, websocket-sandbox) confirmed unrelated to scheduler/D1 changes — all scheduler and automation-store integration tests pass including the new bulk recovery test
ESLint + Prettier clean

Created with Open-Inspect

Summary by CodeRabbit

New Features
- Scheduler recovery sweeps now process orphaned and timed-out runs in bounded, oldest-first batches per tick.
- Bulk recovery updates multiple runs together, ensuring consistent consecutive-failure counting and auto-pausing once the configured threshold is reached.
Bug Fixes
- Recovery now caps how many runs are selected each tick, preventing unexpected over-processing.
- Failure tracking and auto-pausing are more resilient: errors are logged per recovery category and won’t block other recovery paths.
Tests
- Added/updated coverage for limit-and-order selection and bulk recovery behavior (including multi-run and error scenarios).

coderabbitai · 2026-06-24T06:46:15Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 25bbd628-ecc1-4f38-9796-4841bbbc5713

📥 Commits

Reviewing files that changed from the base of the PR and between 42bbf91 and e9ddb69.

📒 Files selected for processing (2)

packages/control-plane/src/scheduler/durable-object.test.ts
packages/control-plane/src/scheduler/durable-object.ts

🚧 Files skipped from review as they are similar to previous changes (2)

packages/control-plane/src/scheduler/durable-object.test.ts
packages/control-plane/src/scheduler/durable-object.ts

📝 Walkthrough

Walkthrough

The scheduler recovery sweep now uses bounded bulk recovery. AutomationStore adds bulk failure and bulk increment methods, the recovery queries accept a limit, and scheduler and integration tests cover the new bulk paths and ordering.

Changes

Bulk Recovery Sweep

Layer / File(s)	Summary
AutomationStore bulk methods and bounded queries `packages/control-plane/src/db/automation-store.ts`, `packages/control-plane/src/db/automation-store.test.ts`	Adds `bulkFailRuns` and `bulkIncrementFailures`, and updates `getOrphanedStartingRuns` and `getTimedOutRunningRuns` to accept a limit and apply bounded ordering in SQL. Unit tests pass `50` as the new limit argument.
recoverySweep bulk rewrite with `RECOVERY_SWEEP_LIMIT` `packages/control-plane/src/scheduler/durable-object.ts`	Adds `RECOVERY_SWEEP_LIMIT = 50` and rewrites `recoverySweep` to fetch both recovery categories concurrently, bulk-fail each category, bulk-increment failure counts, and auto-pause automations that reach the threshold.
Scheduler unit tests for bulk recovery `packages/control-plane/src/scheduler/durable-object.test.ts`	Extends the store mock with bulk methods, updates existing recovery assertions to use bulk APIs, and adds tests for batching, auto-pause logging, and bulk-fail and bulk-track error handling.
Integration coverage for bounded recovery `packages/control-plane/test/integration/automation-store.test.ts`, `packages/control-plane/test/integration/scheduler.test.ts`	Passes the new limit argument to recovery queries, adds ordering assertions for bounded sweep results, and adds an integration test that recovers three orphaned runs in one tick.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

ColeMurray/background-agents#789: Modifies the same getOrphanedStartingRuns and getTimedOutRunningRuns query methods in automation-store.ts, making it directly related to the limit changes here.

Poem

🐇 I hopped through the sweep with a limit of fifty,
Bulk-failing the stragglers, tidy and steady.
One tick, many runs, and the counts all align,
The scheduler hums softly — that rhythm is mine.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly captures the main change: batching recovery sweep updates and adding limits to sweep queries.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/scheduler-recovery-sweep-batch-and-limit

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

github-actions · 2026-06-24T06:46:17Z

Terraform Validation Results

Step	Status
Format	✅
Init	✅
Validate	✅

Note: Terraform plan was skipped because secrets are not configured. This is expected for external contributors. See docs/GETTING_STARTED.md for setup instructions.

Pushed by: @ColeMurray, Action: pull_request

open-inspect

Deep maintainability review focused on recovery sweep structure and backpressure semantics.

open-inspect

Summary

Reviewed PR #825, fix: batch recovery sweep updates and add LIMIT to sweep queries, by ColeMurray. The change is a solid direction for reducing D1 round-trips in recovery sweeps across 6 files (+296/-34), but I found one blocking error-path issue in the new bulk failure flow.

Critical Issues

[Correctness] packages/control-plane/src/scheduler/durable-object.ts:278 - The two bulkFailRuns calls are separate writes inside a single try; if the first category succeeds and the second throws, the method returns before incrementing failure counts for rows already marked failed. Those rows will not be recovered again on the next tick, so failure accounting and auto-pause can be permanently skipped for that partial-success batch.

Suggestions

[Maintainability] packages/control-plane/src/db/automation-store.ts:502 - Consider adding an explicit ORDER BY created_at / ORDER BY started_at before LIMIT so recovery order is deterministic and not dependent on planner/index behavior.

Positive Feedback

The new tests cover the main bulk recovery happy path, auto-pause threshold behavior, and swallowed bulk tracking errors.
Limiting sweep reads and batching writes directly addresses the D1 timeout feedback loop described in the PR.

Verification

Ran npm test -w @open-inspect/control-plane -- src/db/automation-store.test.ts src/scheduler/durable-object.test.ts: 53 tests passed.

Verdict

Request Changes: please fix the partial bulk-failure accounting case before merge.

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/control-plane/src/db/automation-store.ts`:
- Around line 502-518: The bounded recovery queries in AutomationStore should
drain the oldest stuck runs first instead of selecting an arbitrary capped set.
Update getOrphanedStartingRuns and getTimedOutRunningRuns so the SQL referenced
by AutomationStore.ORPHANED_STARTING_RUNS_SQL and
AutomationStore.TIMED_OUT_RUNNING_RUNS_SQL includes an explicit ORDER BY on the
age/timestamp column before the LIMIT ? is applied. Keep the existing cutoff
bind logic, but ensure the ordering makes the sweep process oldest rows first.
- Around line 371-394: The readback query in bulkIncrementFailures is not
aligned with the UPDATE because it returns soft-deleted automations that were
intentionally skipped. Update the SELECT in bulkIncrementFailures to also filter
on deleted_at IS NULL, matching the UPDATE condition so recoverySweep only
receives currently active automations and does not act on stale
consecutive_failures values.

In `@packages/control-plane/src/scheduler/durable-object.ts`:
- Around line 278-324: Do not return immediately from the recovery sweep when
one bulkFailRuns call fails in durable-object.ts; the current try/catch around
the orphaned and timedOut failures prevents bulkIncrementFailures from running
after a partial success. Update the recovery flow in the scheduler recovery
logic so each bulkFailRuns call is handled independently (or otherwise track
partial success) and still accumulate failure counts for any runs that were
successfully marked failed, then keep the existing
bulkIncrementFailures/autoPause path using the affected runs from the sweep.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0940dcbe-15ea-4e46-83ea-4a5211ba9f96

📥 Commits

Reviewing files that changed from the base of the PR and between a6f0618 and 0fc5287.

📒 Files selected for processing (6)

packages/control-plane/src/db/automation-store.test.ts
packages/control-plane/src/db/automation-store.ts
packages/control-plane/src/scheduler/durable-object.test.ts
packages/control-plane/src/scheduler/durable-object.ts
packages/control-plane/test/integration/automation-store.test.ts
packages/control-plane/test/integration/scheduler.test.ts

- Track recoveredRuns per bulkFailRuns call so partial success (orphaned succeeds, timed-out throws) still increments failure counts for the succeeded category instead of returning early - Add ORDER BY created_at/started_at ASC before LIMIT so bounded sweeps drain oldest stuck runs first instead of an arbitrary set - Add deleted_at IS NULL to bulkIncrementFailures readback SELECT to match the UPDATE filter and avoid acting on stale counts

github-actions · 2026-06-24T07:10:31Z

Terraform Validation Results

Step	Status
Format	✅
Init	✅
Validate	✅

Note: Terraform plan was skipped because secrets are not configured. This is expected for external contributors. See docs/GETTING_STARTED.md for setup instructions.

Pushed by: @open-inspect[bot], Action: pull_request

CI typecheck failed: src/scheduler/durable-object.ts(277,26): error TS2552: Cannot find name 'AutomationRunRow'. The type was used for the recoveredRuns array added in 039edae but not added to the existing automation-store import.

github-actions · 2026-06-24T07:12:48Z

Terraform Validation Results

Step	Status
Format	✅
Init	✅
Validate	✅

Note: Terraform plan was skipped because secrets are not configured. This is expected for external contributors. See docs/GETTING_STARTED.md for setup instructions.

Pushed by: @open-inspect[bot], Action: pull_request

github-actions · 2026-06-24T07:29:24Z

Terraform Validation Results

Step	Status
Format	✅
Init	✅
Validate	✅

Note: Terraform plan was skipped because secrets are not configured. This is expected for external contributors. See docs/GETTING_STARTED.md for setup instructions.

Pushed by: @ColeMurray, Action: pull_request

fix: batch recovery sweep updates and add LIMIT to sweep queries

0fc5287

open-inspect Bot reviewed Jun 24, 2026

View reviewed changes

Comment thread packages/control-plane/src/scheduler/durable-object.ts Outdated

open-inspect Bot reviewed Jun 24, 2026

View reviewed changes

Comment thread packages/control-plane/src/scheduler/durable-object.ts Outdated

Comment thread packages/control-plane/src/db/automation-store.ts Outdated

open-inspect Bot previously requested changes Jun 24, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 24, 2026

View reviewed changes

Comment thread packages/control-plane/src/db/automation-store.ts

Comment thread packages/control-plane/src/db/automation-store.ts

Comment thread packages/control-plane/src/scheduler/durable-object.ts Outdated

fix: isolate recovery sweep failure handling

e9ddb69

ColeMurray merged commit 1e99113 into main Jun 24, 2026
18 checks passed

ColeMurray deleted the fix/scheduler-recovery-sweep-batch-and-limit branch June 24, 2026 07:31

Conversation

ColeMurray commented Jun 24, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

LIMIT on recovery sweep queries

Bulk operations replace per-run loop

Files changed

Verification

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

github-actions Bot commented Jun 24, 2026

Terraform Validation Results

Uh oh!

Uh oh!

open-inspect Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

open-inspect Bot left a comment

Choose a reason for hiding this comment

Summary

Critical Issues

Suggestions

Positive Feedback

Verification

Verdict

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 24, 2026

Terraform Validation Results

Uh oh!

github-actions Bot commented Jun 24, 2026

Terraform Validation Results

Uh oh!

github-actions Bot commented Jun 24, 2026

Terraform Validation Results

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ColeMurray commented Jun 24, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 24, 2026 •

edited

Loading