fix(observability): stop anticipated errors polluting error tracking + bound scraper flush#153
Conversation
…+ bound scraper flush
Maple's error tracking materializes from `traces WHERE StatusCode='Error'`, but
the shared OTLP tracer set span status purely from the Effect exit — so every
expected 4xx outcome that propagated as a failure became an `Error` span
(empty-message ones surfaced as `{}` / "Unknown Error"). The `otel.status_code`
attribute was ignored. Separately, the Prometheus scraper's result flush POSTed
its whole buffer (up to 10k rows) in one body with no timeout, overwhelming the
API Worker (edge 503) and hanging for minutes.
Anticipated errors → Ok spans:
- flushable-tracer: new `anticipatedErrorTags` option; a span failing *entirely*
with an anticipated error records OTLP `Ok` and skips the `exception` event
(kept as a trace, latency/status intact). Defects still mark `Error`.
- @maple/domain/anticipated-errors: `ANTICIPATED_ERROR_TAGS` derived
automatically from every domain error annotated `httpApiStatus` 4xx (zero
drift). Wired into maple-api, maple-web, alerting, and the api runtimes.
- Covers UnauthorizedError, RawSqlValidationError, AiTriageNotFoundError,
IntegrationsNotConnectedError, and all other 4xx business errors.
Scraper flush:
- ApiClient: 30s request timeout on all API calls (was unbounded).
- ScrapeScheduler: `sendResultsInChunks` POSTs in 1k-row chunks and re-buffers
only the unsent remainder on failure.
ingest "Unknown Error":
- replay/session handlers record `otel.status_description` with the error
message on 5xx, so genuine failures get a categorizable label.
Verified: bun typecheck (24/24), effect-sdk/domain/scraper vitest suites, ingest
cargo check + otel/api_error tests.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Your LLM provider API key was rejected. Rotate the key in your provider dashboard, then update the matching GitHub Actions secret. Update repo secret → · Model settings → · Setup docs → · Ask in Discord →
|
This file was untracked WIP at branch point (never in main, nothing imports it) and got swept into the previous commit, tripping Knip's unused-files check. It's unrelated to this PR. The anticipated-errors wiring stays in the api worker, alerting, vcs-sync, and AiTriageWorkflow runtimes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Ingest Rust Test + Benchmark ResultsCommit: Load Benchmark —
|
| Metric | main (median) | PR (median) | Delta |
|---|---|---|---|
| Requests/sec | 3100.04 | 2751.38 | -11.2% worse |
| Rows/sec | 31000.38 | 27513.84 | -11.2% worse |
| p50 latency | 20.25 ms | 22.93 ms | +13.3% worse |
| p95 latency | 39.16 ms | 30.35 ms | -22.5% better |
| p99 latency | 43.76 ms | 43.22 ms | -1.2% better |
| Export catch-up | 0.026 s | 0.026 s | -0.5% better |
| Max RSS | 100.23 MiB | 102.09 MiB | +1.9% worse |
| Failures | 0 | 0 | same |
Same code path on both sides (same LOAD_TEST_INGEST_MODE), so the delta column is meaningful. Numbers come from ubuntu-latest, which is noisy — treat single-digit-percent deltas as noise.
PR load benchmark JSON (per-iteration)
[
{
"ingest_mode": "tinybird",
"requests": 2000,
"successes": 2000,
"failures": 0,
"rows_sent": 20000,
"rows_exported": 20000,
"imports": 23,
"duration_seconds": 0.744907582,
"export_catchup_seconds": 0.026594406,
"request_rps": 2684.8968225430144,
"row_rps": 26848.968225430148,
"p50_ms": 22.932,
"p95_ms": 30.353,
"p99_ms": 44.752,
"max_rss_mb": 106.671875,
"max_cpu_percent": 83.9,
"avg_cpu_percent": 51.95
},
{
"ingest_mode": "tinybird",
"requests": 2000,
"successes": 2000,
"failures": 0,
"rows_sent": 20000,
"rows_exported": 20000,
"imports": 23,
"duration_seconds": 0.726906986,
"export_catchup_seconds": 0.0260684,
"request_rps": 2751.3836550196534,
"row_rps": 27513.83655019653,
"p50_ms": 23.145,
"p95_ms": 24.847,
"p99_ms": 43.217,
"max_rss_mb": 101.03125,
"max_cpu_percent": 87.5,
"avg_cpu_percent": 52.05
},
{
"ingest_mode": "tinybird",
"requests": 2000,
"successes": 2000,
"failures": 0,
"rows_sent": 20000,
"rows_exported": 20000,
"imports": 20,
"duration_seconds": 0.638464663,
"export_catchup_seconds": 0.026018714,
"request_rps": 3132.514790407437,
"row_rps": 31325.14790407437,
"p50_ms": 20.185,
"p95_ms": 36.512,
"p99_ms": 40.27,
"max_rss_mb": 102.08984375,
"max_cpu_percent": 96.4,
"avg_cpu_percent": 64.85
}
]main load benchmark JSON (per-iteration)
[
{
"ingest_mode": "tinybird",
"requests": 2000,
"successes": 2000,
"failures": 0,
"rows_sent": 20000,
"rows_exported": 20000,
"imports": 24,
"duration_seconds": 0.713893224,
"export_catchup_seconds": 0.025764818,
"request_rps": 2801.539407803652,
"row_rps": 28015.394078036523,
"p50_ms": 22.136,
"p95_ms": 25.94,
"p99_ms": 47.767,
"max_rss_mb": 99.90234375,
"max_cpu_percent": 89.2,
"avg_cpu_percent": 52.900000000000006
},
{
"ingest_mode": "tinybird",
"requests": 2000,
"successes": 2000,
"failures": 0,
"rows_sent": 20000,
"rows_exported": 20000,
"imports": 22,
"duration_seconds": 0.644295414,
"export_catchup_seconds": 0.026399704,
"request_rps": 3104.1661271237917,
"row_rps": 31041.661271237917,
"p50_ms": 20.178,
"p95_ms": 39.161,
"p99_ms": 40.798,
"max_rss_mb": 102.8984375,
"max_cpu_percent": 96.4,
"avg_cpu_percent": 56.5
},
{
"ingest_mode": "tinybird",
"requests": 2000,
"successes": 2000,
"failures": 0,
"rows_sent": 20000,
"rows_exported": 20000,
"imports": 22,
"duration_seconds": 0.645153394,
"export_catchup_seconds": 0.026190869,
"request_rps": 3100.037942294387,
"row_rps": 31000.379422943868,
"p50_ms": 20.247,
"p95_ms": 39.917,
"p99_ms": 43.757,
"max_rss_mb": 100.2265625,
"max_cpu_percent": 96.4,
"avg_cpu_percent": 64.85
}
]WAL-acked microbench (cargo bench --bench ingest_bench)
Compiling maple-ingest v0.1.0 (/home/runner/work/maple/maple/apps/ingest)
Finished `bench` profile [optimized] target(s) in 39.58s
Running benches/ingest_bench.rs (target/release/deps/ingest_bench-581d2100de893627)
Gnuplot not found, using plotters backend
test ingest_accept/logs_10_rows_wal_ack ... bench: 388591 ns/iter (+/- 8122)
test ingest_accept/traces_10_spans_wal_ack ... bench: 389582 ns/iter (+/- 2390)
cargo test
test telemetry::tests::logs_severity_text_falls_back_to_mapped_number ... ok
test telemetry::tests::metric_encoder_matches_all_tinybird_datasource_shapes ... ok
test telemetry::tests::logs_use_observed_time_when_time_unix_nano_is_zero ... ok
test telemetry::tests::metrics_summary_data_points_are_dropped ... ok
test telemetry::tests::metrics_emit_exactly_the_jsonpaths_declared_in_datasources_ts ... ok
test telemetry::tests::migrate_legacy_shard_relocates_frames_into_lanes ... ok
test telemetry::tests::pipeline_can_start_for_clickhouse_only_without_tinybird_credentials ... ok
test telemetry::tests::clickhouse_export_drops_passworded_non_https_endpoint_without_sending ... ok
test telemetry::tests::pipeline_e2e_exports_gzip_ndjson_to_fake_tinybird ... ok
test telemetry::tests::pipeline_e2e_exports_traces_to_fake_tinybird ... ok
test telemetry::tests::sampling_keeps_errors_even_when_ratio_low ... ok
test telemetry::tests::scraper_contract::scraper_otlp_json_decodes_with_gateway_serde_and_encodes_to_rows ... ok
test telemetry::tests::pipeline_e2e_exports_metrics_to_fake_tinybird ... ok
test telemetry::tests::timestamp_has_nano_precision ... ok
test telemetry::tests::timestamps_match_clickhouse_datetime64_nine_format ... ok
test telemetry::tests::trace_encoder_matches_tinybird_row_shape ... ok
test telemetry::tests::traces_emit_exactly_the_jsonpaths_declared_in_datasources_ts ... ok
test telemetry::tests::wal_partial_drain_advances_cursor_without_truncating ... ok
test telemetry::tests::wal_round_trips_frame ... ok
test telemetry::tests::wal_truncates_after_full_drain_allowing_further_appends ... ok
test telemetry::tests::pipeline_exports_ready_org_to_clickhouse_without_tinybird_calls ... ok
test telemetry::tests::slow_clickhouse_lane_does_not_block_cosharded_tinybird_org ... ok
test telemetry::tests::clickhouse_breaker_sheds_after_threshold_failures ... ok
test result: ok. 33 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.78s
Running unittests src/bin/load_test.rs (target/debug/deps/load_test-661a0aa1eb3f6d6d)
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s
Running unittests src/main.rs (target/debug/deps/maple_ingest-c33bf80c577edb95)
running 37 tests
test autumn::tests::allowed_only_no_balance_field ... ok
test autumn::tests::flat_hardcap_with_remaining_allows ... ok
test autumn::tests::flat_hardcap_depleted_blocks ... ok
test autumn::tests::flat_overage_allows ... ok
test autumn::tests::flat_unlimited_allows ... ok
test autumn::tests::flat_sub_one_gb_remaining_still_allows ... ok
test autumn::tests::nested_balance_object_depleted_blocks ... ok
test autumn::tests::nested_balance_object_with_remaining_allows ... ok
test autumn::tests::nested_overage_allows ... ok
test autumn::tests::null_balance_no_subscription_blocks ... ok
test autumn::tests::unrecognized_shape_returns_none ... ok
test tests::api_error_kind_maps_status_to_stable_label ... ok
test tests::clickhouse_destination_is_terminal_in_dual_mode ... ok
test tests::clickhouse_destination_uses_native_pipeline_even_in_forward_mode ... ok
test tests::clickhouse_target_resolver_decrypts_current_schema_password ... ok
test tests::clickhouse_target_resolver_rejects_password_over_http ... ok
test tests::cloudflare_log_record_maps_body_severity_and_attributes ... ok
test tests::cloudflare_ndjson_payload_parses_multiple_records ... ok
test tests::clickhouse_target_resolver_requires_current_schema ... ok
test tests::cloudflare_timestamps_support_rfc3339_unix_and_unix_nano ... ok
test tests::cloudflare_validation_payload_is_detected ... ok
test tests::decrypt_aes256_gcm_matches_node_crypto_fixture ... ok
test tests::extract_ingest_key_returns_sentinel_literal_unchanged ... ok
test tests::enrichment_overwrites_tenant_fields ... ok
test tests::rejection_span_status_is_error_only_for_5xx ... ok
test tests::hash_is_deterministic ... ok
test tests::resolve_ingest_key_keeps_stale_schema_on_managed_native_path ... ok
test tests::resolve_connector_refreshes_routing_before_auth_cache_expires ... ok
test tests::resolve_ingest_key_refreshes_routing_before_auth_cache_expires ... ok
test tests::resolve_ingest_key_returns_none_when_hash_missing ... ok
test tests::resolve_ingest_key_returns_self_managed_false_when_no_settings_row ... ok
test tests::resolve_ingest_key_returns_self_managed_true_when_active_settings_row ... ok
test tests::sentinel_token_matches_only_exact_literal ... ok
test tests::tinybird_destination_keeps_forward_mode_on_forward_path ... ok
test tests::resolve_ingest_key_serves_last_known_routing_when_refresh_fails ... ok
test autumn::tests::fails_open_on_transport_error ... ok
test tests::forward_mode_switches_ready_org_to_clickhouse_without_forwarding_again ... ok
test result: ok. 37 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.24s
Doc-tests maple_ingest
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s
|
@pullfrog review |
|
Your LLM provider API key was rejected. Rotate the key in your provider dashboard, then update the matching GitHub Actions secret. Update repo secret → · Model settings → · Setup docs → · Ask in Discord →
|
|
@pullfrog review |
|
Your LLM provider API key was rejected. Rotate the key in your provider dashboard, then update the matching GitHub Actions secret. Update repo secret → · Model settings → · Setup docs → · Ask in Discord →
|
…ush handlers Extends the replay/session 5xx labeling to handle_signal (traces/logs/metrics — the majority of ingest traffic) and handle_cloudflare_logpush, so genuine 5xx failures there also carry a status message instead of bucketing under "Unknown Error". Addresses Devin review feedback. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Revert the bot's floating @v0 to a pinned commit SHA. The action runs with contents/PR/issues write + provider secrets, so a mutable tag is a supply-chain risk (per Devin review). Pin to the commit v0 currently resolves to; bump manually (or via Dependabot) on new releases. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Resolve apps/ingest/src/main.rs: main independently shipped the same otel.status_description fix (kill "Unknown Error" on 5xx) across all handlers, so take main's canonical version. My redundant message() accessor + conditional variant are dropped; the rest of this PR (scraper flush, SDK anticipated-error tracer, domain tags + wiring) is unaffected.
Covers the Die-vs-Interrupt asymmetry in isFullyAnticipated (Devin review): an interrupt co-occurring with an anticipated failure keeps the span Ok (interrupts are normal fiber control flow), whereas a defect forces Error. Intentional. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Why
Auditing Maple's own MCP error data surfaced ~2,000 error spans/6h, most of them not real bugs:
Errorspans (validation, not-found, unauthorized, integrations-not-connected). Maple's error tracking materializes fromtraces WHERE StatusCode='Error', and the shared OTLP tracer set status purely from the Effect exit — so any expected error that propagated as a failure marked every span it passed through asError(empty-message ones surfaced as{}/ "Unknown Error"). Theotel.status_codeattribute was being ignored.apps/scraperPOSTed its whole result buffer (up to 10k rows) to/api/internal/scrape-resultsin one body with no request timeout, overwhelming the API Worker (edge 503) and hanging 87–280s.What changed
Anticipated errors →
Okspanslib/effect-sdkflushable-tracer: newanticipatedErrorTagsoption (threaded through the cloudflare/server/client presets). A span failing entirely with an anticipated error now records OTLPOkand skips theexceptionevent — it stays visible as a trace (latency,http.response.status_code) but never counts as an error. A defect (Die) mixed in still marksError.@maple/domain/anticipated-errors:ANTICIPATED_ERROR_TAGS, derived automatically from every domain error annotated with a 4xxhttpApiStatus(zero-drift — new 4xx errors are picked up without edits). Wired into maple-api, maple-web, alerting, and the api runtimes (billing-suspension, vcs-sync, AiTriageWorkflow).otel_status_for_rejection(4xx → Ok) rule on the TypeScript side.Scraper flush
ApiClient: 30s request timeout on all API calls (was unbounded).ScrapeScheduler: extractedsendResultsInChunks— flush POSTs in 1,000-row chunks and re-buffers only the unsent remainder on the first failure.ingest "Unknown Error" on
/v1/sessionReplays/blobotel.status_descriptionwith the error message on 5xx, so genuine failures get a categorizable label instead of bucketing as "Unknown Error".Already-fixed (no change)
{}on/v1/traceswas already addressed by theIngestRejectedrefactor (c7dbc41ab); the stale occurrences age out as binaries update.Reviewer notes
otel.status_code: "Ok"annotations invcs-webhook/billing-webhookwere no-ops (the tracer ignored the attribute); they still work because those handlers return success exits. Left as-is.anticipatedErrorTagsis reflection-derived via public Schema AST annotations (ast.annotations.httpApiStatus,_tagliteral) — guarded by a unit test asserting the four observed tags resolve and 5xx errors don't.Verification
bun typecheck— 24/24 packages passcargo check+ otel/api_error tests passfind_errorsover a fresh window to confirm the buckets are gone / relabeled.Out of scope (follow-ups)
alertingPostgrespermission deniedflood (~1,635/6h — runtime role missing GRANTs after table rebuilds).CLERK_JWT_KEY"Invalid RSA key in JSON Web Key" config issue.🤖 Generated with Claude Code