Skip to content

fix(alerting): stop warehouse-timeout fan-out storm in error/anomaly crons#158

Merged
Makisuo merged 3 commits into
mainfrom
fix/alerting-warehouse-timeout-storm
Jul 1, 2026
Merged

fix(alerting): stop warehouse-timeout fan-out storm in error/anomaly crons#158
Makisuo merged 3 commits into
mainfrom
fix/alerting-warehouse-timeout-storm

Conversation

@Makisuo

@Makisuo Makisuo commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

What & why

On 2026-07-01 (03:30–05:10 UTC) the alerting worker fired a burst of ~3,300 WarehouseQueryError: "The operation was aborted due to timeout" from its error_tick / anomaly_tick crons. It was not a Tinybird outage (maple-api warehouse queries were fine, alerting cron volume was flat) — it was a self-inflicted fan-out storm.

Root cause: active-org discovery failed open. A brief warehouse slowdown made the unbounded cross-org discovery query (SELECT OrgId FROM error_events_by_time … GROUP BY OrgId, no OrgId predicate ⇒ no primary-key pruning, no cost profile ⇒ no server-side max_execution_time) time out. On that failure the tick fell back to scanning every known org (hundreds) instead of the ~26 active ones — dumping maximum load exactly when the warehouse was already struggling, which sustained the storm for ~90 min until the blip cleared.

The telemetry fingerprint (per-10-min executeSql spans, distinct orgs scanned):

Time (UTC) queries timeouts distinct orgs p90
02:30–03:20 ~630 0 26 0.3s
03:40–04:20 ~38 ~28 1 30.2s
04:40 722 483 101 30.2s
05:00 2521 58 252 0.9s
05:10+ ~625 0 26 0.4s

The fix (low-risk hardening, reuses existing machinery)

  • P0a — bound the trigger: discovery queries now use the existing discovery cost profile (5s server-side) instead of riding the ~30s client abort.
  • P0b — fail closed: on discovery failure, reuse the last-known active-org set (edge-cached, 6h TTL) instead of fanning out to all orgs. Cold cache → just BYO + the withState / open-incident floor (so resolution/aging keeps running). The "all" sentinel is removed entirely. Wires EdgeCacheService into ErrorsService (Anomaly already had it).
  • P1a — bound per-org scans: errorIssuesScan now carries the list profile (15s) so a scan can't hang the full ~30s.
  • P1b — client timeout: each managed query attempt in the executor is bounded at server budget + 5s (or a 30s hard cap), mapped to a non-transient WarehouseQueryError so a queued query fails fast instead of feeding the retry loop. This directly addresses the observed case where a list query with max_execution_time=15 still rode to 30s because Tinybird queued it.

Deferred: P2 (per-tick circuit breaker + overall tick timeout) — follow-up.

Reviewer notes

  • Happy-path behavior is unchanged: discovery success still returns the active set (now also cached); only the failure path changed (was "all", now cached/BYO).
  • ErrorsService gains an EdgeCacheService dependency — wired in app.ts and the test layer.
  • Backend cron/warehouse logic, nothing browser-observable.

Tests

New: fail-closed gating (stateful orgs still scanned, idle orgs skipped), profile assertions (discovery→discovery, scan→list), and a hung-query timeout test (bounded + non-transient).

  • query-engine: 648 pass, typecheck clean
  • apps/api ErrorsService + anomaly: 85 pass, typecheck clean

🤖 Generated with Claude Code


Open in Devin Review

…crons

A transient warehouse slowdown on 2026-07-01 (03:30–05:10 UTC) was amplified
into a ~90-minute self-sustained storm of ~3,300 WarehouseQueryError timeouts
in the alerting crons. Root cause: active-org discovery failed OPEN — on any
discovery error the tick scanned *every* known org (hundreds) instead of the
~26 active ones, exactly when the warehouse was already struggling. The
telemetry fingerprint was distinct_orgs 26 → 101 → 252 → 26.

Hardening (reuses existing machinery, no behavior change on the happy path):

- P0a: bound the discovery queries with the existing `discovery` cost profile
  (5s server-side) so the unbounded cross-org scan fails cheap instead of
  riding the ~30s client abort. (errorActiveOrgsDiscovery + anomaly discovery)
- P0b: fail CLOSED. On discovery failure, reuse the last-known active-org set
  (edge-cached, 6h TTL) instead of fanning out to all orgs; cold cache falls
  back to just BYO + the withState / open-incident floor. Drops the "all"
  sentinel entirely. Wires EdgeCacheService into ErrorsService.
- P1a: give the per-org errorIssuesScan the `list` profile (15s) so a scan
  can't hang the full ~30s during a slowdown.
- P1b: bound each managed query attempt in the executor at
  `server budget + 5s` (or a 30s hard cap), mapped to a non-transient
  WarehouseQueryError so a queued query fails fast instead of feeding retries.

Tests: fail-closed gating (stateful orgs still scanned, idle orgs skipped),
profile assertions (discovery→discovery, scan→list), and a hung-query timeout
test. query-engine 648 pass, apps/api ErrorsService/anomaly 85 pass; both
typecheck clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@pullfrog

pullfrog Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Your LLM provider API key was rejected. Rotate the key in your provider dashboard, then update the matching GitHub Actions secret.

Update repo secret → · Model settings → · Setup docs → · Ask in Discord →

Pullfrog  | ⚠️ this action is pinned to a commit SHA, which freezes the cleanup step — switch to @v0 or keep the SHA fresh with Dependabot | Rerun failed job ➔View workflow run | via Pullfrog | Using Claude Opus𝕏

devin-ai-integration[bot]

This comment was marked as resolved.

…ing worker

ErrorsService gained an EdgeCacheService dependency (fail-closed active-org
cache); the api app layer was updated but the alerting worker's own layer
composition still built ErrorsServiceLive without it, failing @maple/alerting
typecheck. Mirror the AnomalyDetectionServiceLive wiring.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The unbounded profile is the explicit opt-out from cost limits, but the P1b
client timeout still imposed a 30s hard cap on it. Honor the opt-out on the
client side too: unbounded queries get no client cap (on Workers they still
ride the ambient ~30s fetch limit). Queries with no declared budget keep the
30s hard cap. Adds a test asserting an unbounded query is not client-timed-out.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@Makisuo Makisuo merged commit 5cb56d5 into main Jul 1, 2026
7 checks passed
@Makisuo Makisuo deleted the fix/alerting-warehouse-timeout-storm branch July 1, 2026 21:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant