fix(alerting): stop warehouse-timeout fan-out storm in error/anomaly crons by Makisuo · Pull Request #158 · MapleTechLabs/maple

Makisuo · 2026-07-01T18:00:00Z

What & why

On 2026-07-01 (03:30–05:10 UTC) the alerting worker fired a burst of ~3,300 WarehouseQueryError: "The operation was aborted due to timeout" from its error_tick / anomaly_tick crons. It was not a Tinybird outage (maple-api warehouse queries were fine, alerting cron volume was flat) — it was a self-inflicted fan-out storm.

Root cause: active-org discovery failed open. A brief warehouse slowdown made the unbounded cross-org discovery query (SELECT OrgId FROM error_events_by_time … GROUP BY OrgId, no OrgId predicate ⇒ no primary-key pruning, no cost profile ⇒ no server-side max_execution_time) time out. On that failure the tick fell back to scanning every known org (hundreds) instead of the ~26 active ones — dumping maximum load exactly when the warehouse was already struggling, which sustained the storm for ~90 min until the blip cleared.

The telemetry fingerprint (per-10-min executeSql spans, distinct orgs scanned):

Time (UTC)	queries	timeouts	distinct orgs	p90
02:30–03:20	~630	0	26	0.3s
03:40–04:20	~38	~28	1	30.2s
04:40	722	483	101	30.2s
05:00	2521	58	252	0.9s
05:10+	~625	0	26	0.4s

The fix (low-risk hardening, reuses existing machinery)

P0a — bound the trigger: discovery queries now use the existing discovery cost profile (5s server-side) instead of riding the ~30s client abort.
P0b — fail closed: on discovery failure, reuse the last-known active-org set (edge-cached, 6h TTL) instead of fanning out to all orgs. Cold cache → just BYO + the withState / open-incident floor (so resolution/aging keeps running). The "all" sentinel is removed entirely. Wires EdgeCacheService into ErrorsService (Anomaly already had it).
P1a — bound per-org scans: errorIssuesScan now carries the list profile (15s) so a scan can't hang the full ~30s.
P1b — client timeout: each managed query attempt in the executor is bounded at server budget + 5s (or a 30s hard cap), mapped to a non-transient WarehouseQueryError so a queued query fails fast instead of feeding the retry loop. This directly addresses the observed case where a list query with max_execution_time=15 still rode to 30s because Tinybird queued it.

Deferred: P2 (per-tick circuit breaker + overall tick timeout) — follow-up.

Reviewer notes

Happy-path behavior is unchanged: discovery success still returns the active set (now also cached); only the failure path changed (was "all", now cached/BYO).
ErrorsService gains an EdgeCacheService dependency — wired in app.ts and the test layer.
Backend cron/warehouse logic, nothing browser-observable.

Tests

New: fail-closed gating (stateful orgs still scanned, idle orgs skipped), profile assertions (discovery→discovery, scan→list), and a hung-query timeout test (bounded + non-transient).

query-engine: 648 pass, typecheck clean
apps/api ErrorsService + anomaly: 85 pass, typecheck clean

🤖 Generated with Claude Code

…crons A transient warehouse slowdown on 2026-07-01 (03:30–05:10 UTC) was amplified into a ~90-minute self-sustained storm of ~3,300 WarehouseQueryError timeouts in the alerting crons. Root cause: active-org discovery failed OPEN — on any discovery error the tick scanned *every* known org (hundreds) instead of the ~26 active ones, exactly when the warehouse was already struggling. The telemetry fingerprint was distinct_orgs 26 → 101 → 252 → 26. Hardening (reuses existing machinery, no behavior change on the happy path): - P0a: bound the discovery queries with the existing `discovery` cost profile (5s server-side) so the unbounded cross-org scan fails cheap instead of riding the ~30s client abort. (errorActiveOrgsDiscovery + anomaly discovery) - P0b: fail CLOSED. On discovery failure, reuse the last-known active-org set (edge-cached, 6h TTL) instead of fanning out to all orgs; cold cache falls back to just BYO + the withState / open-incident floor. Drops the "all" sentinel entirely. Wires EdgeCacheService into ErrorsService. - P1a: give the per-org errorIssuesScan the `list` profile (15s) so a scan can't hang the full ~30s during a slowdown. - P1b: bound each managed query attempt in the executor at `server budget + 5s` (or a 30s hard cap), mapped to a non-transient WarehouseQueryError so a queued query fails fast instead of feeding retries. Tests: fail-closed gating (stateful orgs still scanned, idle orgs skipped), profile assertions (discovery→discovery, scan→list), and a hung-query timeout test. query-engine 648 pass, apps/api ErrorsService/anomaly 85 pass; both typecheck clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

pullfrog · 2026-07-01T18:00:07Z

Your LLM provider API key was rejected. Rotate the key in your provider dashboard, then update the matching GitHub Actions secret.

Update repo secret → · Model settings → · Setup docs → · Ask in Discord →

^{｜ ⚠️ this action is pinned to a commit SHA, which freezes the cleanup step — switch to @v0 or keep the SHA fresh with Dependabot ｜ Rerun failed job ➔ ｜ View workflow run ｜ via Pullfrog ｜ Using Claude Opus ｜ 𝕏}

…ing worker ErrorsService gained an EdgeCacheService dependency (fail-closed active-org cache); the api app layer was updated but the alerting worker's own layer composition still built ErrorsServiceLive without it, failing @maple/alerting typecheck. Mirror the AnomalyDetectionServiceLive wiring. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The unbounded profile is the explicit opt-out from cost limits, but the P1b client timeout still imposed a 30s hard cap on it. Honor the opt-out on the client side too: unbounded queries get no client cap (on Workers they still ride the ambient ~30s fetch limit). Queries with no declared budget keep the 30s hard cap. Adds a test asserting an unbounded query is not client-timed-out. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Makisuo had a problem deploying to pr-preview July 1, 2026 18:00 — with GitHub Actions Error

This comment was marked as resolved.

Sign in to view

Makisuo had a problem deploying to pr-preview July 1, 2026 18:07 — with GitHub Actions Error

Makisuo temporarily deployed to pr-preview July 1, 2026 18:10 — with GitHub Actions Inactive

Makisuo merged commit 5cb56d5 into main Jul 1, 2026
7 checks passed

Makisuo deleted the fix/alerting-warehouse-timeout-storm branch July 1, 2026 21:12

Makisuo temporarily deployed to pr-preview July 1, 2026 21:12 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(alerting): stop warehouse-timeout fan-out storm in error/anomaly crons#158

fix(alerting): stop warehouse-timeout fan-out storm in error/anomaly crons#158
Makisuo merged 3 commits into
mainfrom
fix/alerting-warehouse-timeout-storm

Makisuo commented Jul 1, 2026 •

edited by devin-ai-integration Bot

Loading

Uh oh!

pullfrog Bot commented Jul 1, 2026 •

edited

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Makisuo commented Jul 1, 2026 • edited by devin-ai-integration Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What & why

The fix (low-risk hardening, reuses existing machinery)

Reviewer notes

Tests

Uh oh!

pullfrog Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Makisuo commented Jul 1, 2026 •

edited by devin-ai-integration Bot

Loading

pullfrog Bot commented Jul 1, 2026 •

edited

Loading