fix(alerting): stop warehouse-timeout fan-out storm in error/anomaly crons#158
Merged
Conversation
…crons A transient warehouse slowdown on 2026-07-01 (03:30–05:10 UTC) was amplified into a ~90-minute self-sustained storm of ~3,300 WarehouseQueryError timeouts in the alerting crons. Root cause: active-org discovery failed OPEN — on any discovery error the tick scanned *every* known org (hundreds) instead of the ~26 active ones, exactly when the warehouse was already struggling. The telemetry fingerprint was distinct_orgs 26 → 101 → 252 → 26. Hardening (reuses existing machinery, no behavior change on the happy path): - P0a: bound the discovery queries with the existing `discovery` cost profile (5s server-side) so the unbounded cross-org scan fails cheap instead of riding the ~30s client abort. (errorActiveOrgsDiscovery + anomaly discovery) - P0b: fail CLOSED. On discovery failure, reuse the last-known active-org set (edge-cached, 6h TTL) instead of fanning out to all orgs; cold cache falls back to just BYO + the withState / open-incident floor. Drops the "all" sentinel entirely. Wires EdgeCacheService into ErrorsService. - P1a: give the per-org errorIssuesScan the `list` profile (15s) so a scan can't hang the full ~30s during a slowdown. - P1b: bound each managed query attempt in the executor at `server budget + 5s` (or a 30s hard cap), mapped to a non-transient WarehouseQueryError so a queued query fails fast instead of feeding retries. Tests: fail-closed gating (stateful orgs still scanned, idle orgs skipped), profile assertions (discovery→discovery, scan→list), and a hung-query timeout test. query-engine 648 pass, apps/api ErrorsService/anomaly 85 pass; both typecheck clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Contributor
|
Your LLM provider API key was rejected. Rotate the key in your provider dashboard, then update the matching GitHub Actions secret. Update repo secret → · Model settings → · Setup docs → · Ask in Discord →
|
…ing worker ErrorsService gained an EdgeCacheService dependency (fail-closed active-org cache); the api app layer was updated but the alerting worker's own layer composition still built ErrorsServiceLive without it, failing @maple/alerting typecheck. Mirror the AnomalyDetectionServiceLive wiring. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The unbounded profile is the explicit opt-out from cost limits, but the P1b client timeout still imposed a 30s hard cap on it. Honor the opt-out on the client side too: unbounded queries get no client cap (on Workers they still ride the ambient ~30s fetch limit). Queries with no declared budget keep the 30s hard cap. Adds a test asserting an unbounded query is not client-timed-out. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

What & why
On 2026-07-01 (03:30–05:10 UTC) the alerting worker fired a burst of ~3,300
WarehouseQueryError: "The operation was aborted due to timeout"from itserror_tick/anomaly_tickcrons. It was not a Tinybird outage (maple-apiwarehouse queries were fine, alerting cron volume was flat) — it was a self-inflicted fan-out storm.Root cause: active-org discovery failed open. A brief warehouse slowdown made the unbounded cross-org discovery query (
SELECT OrgId FROM error_events_by_time … GROUP BY OrgId, noOrgIdpredicate ⇒ no primary-key pruning, no cost profile ⇒ no server-sidemax_execution_time) time out. On that failure the tick fell back to scanning every known org (hundreds) instead of the ~26 active ones — dumping maximum load exactly when the warehouse was already struggling, which sustained the storm for ~90 min until the blip cleared.The telemetry fingerprint (per-10-min
executeSqlspans, distinct orgs scanned):The fix (low-risk hardening, reuses existing machinery)
discoverycost profile (5s server-side) instead of riding the ~30s client abort.withState/ open-incident floor (so resolution/aging keeps running). The"all"sentinel is removed entirely. WiresEdgeCacheServiceintoErrorsService(Anomaly already had it).errorIssuesScannow carries thelistprofile (15s) so a scan can't hang the full ~30s.server budget + 5s(or a 30s hard cap), mapped to a non-transientWarehouseQueryErrorso a queued query fails fast instead of feeding the retry loop. This directly addresses the observed case where alistquery withmax_execution_time=15still rode to 30s because Tinybird queued it.Deferred: P2 (per-tick circuit breaker + overall tick timeout) — follow-up.
Reviewer notes
"all", now cached/BYO).ErrorsServicegains anEdgeCacheServicedependency — wired inapp.tsand the test layer.Tests
New: fail-closed gating (stateful orgs still scanned, idle orgs skipped), profile assertions (discovery→
discovery, scan→list), and a hung-query timeout test (bounded + non-transient).query-engine: 648 pass, typecheck cleanapps/apiErrorsService + anomaly: 85 pass, typecheck clean🤖 Generated with Claude Code