fix(observability): stop anticipated errors polluting error tracking + bound scraper flush by Makisuo · Pull Request #153 · MapleTechLabs/maple

Makisuo · 2026-06-30T20:30:49Z

Why

Auditing Maple's own MCP error data surfaced ~2,000 error spans/6h, most of them not real bugs:

Anticipated 4xx outcomes recorded as Error spans (validation, not-found, unauthorized, integrations-not-connected). Maple's error tracking materializes from traces WHERE StatusCode='Error', and the shared OTLP tracer set status purely from the Effect exit — so any expected error that propagated as a failure marked every span it passed through as Error (empty-message ones surfaced as {} / "Unknown Error"). The otel.status_code attribute was being ignored.
Scraper flush transport errors — apps/scraper POSTed its whole result buffer (up to 10k rows) to /api/internal/scrape-results in one body with no request timeout, overwhelming the API Worker (edge 503) and hanging 87–280s.

What changed

Anticipated errors → `Ok` spans

lib/effect-sdk flushable-tracer: new anticipatedErrorTags option (threaded through the cloudflare/server/client presets). A span failing entirely with an anticipated error now records OTLP Ok and skips the exception event — it stays visible as a trace (latency, http.response.status_code) but never counts as an error. A defect (Die) mixed in still marks Error.
@maple/domain/anticipated-errors: ANTICIPATED_ERROR_TAGS, derived automatically from every domain error annotated with a 4xx httpApiStatus (zero-drift — new 4xx errors are picked up without edits). Wired into maple-api, maple-web, alerting, and the api runtimes (billing-suspension, vcs-sync, AiTriageWorkflow).
Mirrors the ingest gateway's existing otel_status_for_rejection (4xx → Ok) rule on the TypeScript side.

Scraper flush

ApiClient: 30s request timeout on all API calls (was unbounded).
ScrapeScheduler: extracted sendResultsInChunks — flush POSTs in 1,000-row chunks and re-buffers only the unsent remainder on the first failure.

ingest "Unknown Error" on `/v1/sessionReplays/blob`

The replay/session handlers (meta, blob, sessionEvents) now record otel.status_description with the error message on 5xx, so genuine failures get a categorizable label instead of bucketing as "Unknown Error".

Already-fixed (no change)

The maple-cli {} on /v1/traces was already addressed by the IngestRejected refactor (c7dbc41ab); the stale occurrences age out as binaries update.

Reviewer notes

The otel.status_code: "Ok" annotations in vcs-webhook/billing-webhook were no-ops (the tracer ignored the attribute); they still work because those handlers return success exits. Left as-is.
anticipatedErrorTags is reflection-derived via public Schema AST annotations (ast.annotations.httpApiStatus, _tag literal) — guarded by a unit test asserting the four observed tags resolve and 5xx errors don't.

Verification

bun typecheck — 24/24 packages pass
vitest: effect-sdk (64), domain (155), scraper (59) — incl. new tracer / anticipated-errors / chunking tests
ingest: cargo check + otel/api_error tests pass
Post-deploy: re-run find_errors over a fresh window to confirm the buckets are gone / relabeled.

Out of scope (follow-ups)

alerting Postgres permission denied flood (~1,635/6h — runtime role missing GRANTs after table rebuilds).
Clerk digest "Failed to list Clerk members" and the CLERK_JWT_KEY "Invalid RSA key in JSON Web Key" config issue.

🤖 Generated with Claude Code

…+ bound scraper flush Maple's error tracking materializes from `traces WHERE StatusCode='Error'`, but the shared OTLP tracer set span status purely from the Effect exit — so every expected 4xx outcome that propagated as a failure became an `Error` span (empty-message ones surfaced as `{}` / "Unknown Error"). The `otel.status_code` attribute was ignored. Separately, the Prometheus scraper's result flush POSTed its whole buffer (up to 10k rows) in one body with no timeout, overwhelming the API Worker (edge 503) and hanging for minutes. Anticipated errors → Ok spans: - flushable-tracer: new `anticipatedErrorTags` option; a span failing *entirely* with an anticipated error records OTLP `Ok` and skips the `exception` event (kept as a trace, latency/status intact). Defects still mark `Error`. - @maple/domain/anticipated-errors: `ANTICIPATED_ERROR_TAGS` derived automatically from every domain error annotated `httpApiStatus` 4xx (zero drift). Wired into maple-api, maple-web, alerting, and the api runtimes. - Covers UnauthorizedError, RawSqlValidationError, AiTriageNotFoundError, IntegrationsNotConnectedError, and all other 4xx business errors. Scraper flush: - ApiClient: 30s request timeout on all API calls (was unbounded). - ScrapeScheduler: `sendResultsInChunks` POSTs in 1k-row chunks and re-buffers only the unsent remainder on failure. ingest "Unknown Error": - replay/session handlers record `otel.status_description` with the error message on 5xx, so genuine failures get a categorizable label. Verified: bun typecheck (24/24), effect-sdk/domain/scraper vitest suites, ingest cargo check + otel/api_error tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

pullfrog · 2026-06-30T20:30:52Z

Your LLM provider API key was rejected. Rotate the key in your provider dashboard, then update the matching GitHub Actions secret.

Update repo secret → · Model settings → · Setup docs → · Ask in Discord →

^{｜ ⚠️ this action is pinned to a commit SHA, which freezes the cleanup step — switch to @v0 or keep the SHA fresh with Dependabot ｜ Rerun failed job ➔ ｜ View workflow run ｜ via Pullfrog ｜ Using Claude Opus ｜ 𝕏}

This file was untracked WIP at branch point (never in main, nothing imports it) and got swept into the previous commit, tripping Knip's unused-files check. It's unrelated to this PR. The anticipated-errors wiring stays in the api worker, alerting, vcs-sync, and AiTriageWorkflow runtimes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

github-actions · 2026-06-30T20:35:40Z

Ingest Rust Test + Benchmark Results

Commit: acf2d7f25977fdb026077e1cd9b25681bc0fb387

Load Benchmark — `tinybird` mode, median of 3 run(s) vs main

Metric	main (median)	PR (median)	Delta
Requests/sec	3100.04	2751.38	-11.2% worse
Rows/sec	31000.38	27513.84	-11.2% worse
p50 latency	20.25 ms	22.93 ms	+13.3% worse
p95 latency	39.16 ms	30.35 ms	-22.5% better
p99 latency	43.76 ms	43.22 ms	-1.2% better
Export catch-up	0.026 s	0.026 s	-0.5% better
Max RSS	100.23 MiB	102.09 MiB	+1.9% worse
Failures	0	0	same

Same code path on both sides (same LOAD_TEST_INGEST_MODE), so the delta column is meaningful. Numbers come from ubuntu-latest, which is noisy — treat single-digit-percent deltas as noise.

PR load benchmark JSON (per-iteration)

[
  {
    "ingest_mode": "tinybird",
    "requests": 2000,
    "successes": 2000,
    "failures": 0,
    "rows_sent": 20000,
    "rows_exported": 20000,
    "imports": 23,
    "duration_seconds": 0.744907582,
    "export_catchup_seconds": 0.026594406,
    "request_rps": 2684.8968225430144,
    "row_rps": 26848.968225430148,
    "p50_ms": 22.932,
    "p95_ms": 30.353,
    "p99_ms": 44.752,
    "max_rss_mb": 106.671875,
    "max_cpu_percent": 83.9,
    "avg_cpu_percent": 51.95
  },
  {
    "ingest_mode": "tinybird",
    "requests": 2000,
    "successes": 2000,
    "failures": 0,
    "rows_sent": 20000,
    "rows_exported": 20000,
    "imports": 23,
    "duration_seconds": 0.726906986,
    "export_catchup_seconds": 0.0260684,
    "request_rps": 2751.3836550196534,
    "row_rps": 27513.83655019653,
    "p50_ms": 23.145,
    "p95_ms": 24.847,
    "p99_ms": 43.217,
    "max_rss_mb": 101.03125,
    "max_cpu_percent": 87.5,
    "avg_cpu_percent": 52.05
  },
  {
    "ingest_mode": "tinybird",
    "requests": 2000,
    "successes": 2000,
    "failures": 0,
    "rows_sent": 20000,
    "rows_exported": 20000,
    "imports": 20,
    "duration_seconds": 0.638464663,
    "export_catchup_seconds": 0.026018714,
    "request_rps": 3132.514790407437,
    "row_rps": 31325.14790407437,
    "p50_ms": 20.185,
    "p95_ms": 36.512,
    "p99_ms": 40.27,
    "max_rss_mb": 102.08984375,
    "max_cpu_percent": 96.4,
    "avg_cpu_percent": 64.85
  }
]

main load benchmark JSON (per-iteration)

[
  {
    "ingest_mode": "tinybird",
    "requests": 2000,
    "successes": 2000,
    "failures": 0,
    "rows_sent": 20000,
    "rows_exported": 20000,
    "imports": 24,
    "duration_seconds": 0.713893224,
    "export_catchup_seconds": 0.025764818,
    "request_rps": 2801.539407803652,
    "row_rps": 28015.394078036523,
    "p50_ms": 22.136,
    "p95_ms": 25.94,
    "p99_ms": 47.767,
    "max_rss_mb": 99.90234375,
    "max_cpu_percent": 89.2,
    "avg_cpu_percent": 52.900000000000006
  },
  {
    "ingest_mode": "tinybird",
    "requests": 2000,
    "successes": 2000,
    "failures": 0,
    "rows_sent": 20000,
    "rows_exported": 20000,
    "imports": 22,
    "duration_seconds": 0.644295414,
    "export_catchup_seconds": 0.026399704,
    "request_rps": 3104.1661271237917,
    "row_rps": 31041.661271237917,
    "p50_ms": 20.178,
    "p95_ms": 39.161,
    "p99_ms": 40.798,
    "max_rss_mb": 102.8984375,
    "max_cpu_percent": 96.4,
    "avg_cpu_percent": 56.5
  },
  {
    "ingest_mode": "tinybird",
    "requests": 2000,
    "successes": 2000,
    "failures": 0,
    "rows_sent": 20000,
    "rows_exported": 20000,
    "imports": 22,
    "duration_seconds": 0.645153394,
    "export_catchup_seconds": 0.026190869,
    "request_rps": 3100.037942294387,
    "row_rps": 31000.379422943868,
    "p50_ms": 20.247,
    "p95_ms": 39.917,
    "p99_ms": 43.757,
    "max_rss_mb": 100.2265625,
    "max_cpu_percent": 96.4,
    "avg_cpu_percent": 64.85
  }
]

WAL-acked microbench (`cargo bench --bench ingest_bench`)

   Compiling maple-ingest v0.1.0 (/home/runner/work/maple/maple/apps/ingest)
    Finished `bench` profile [optimized] target(s) in 39.58s
     Running benches/ingest_bench.rs (target/release/deps/ingest_bench-581d2100de893627)
Gnuplot not found, using plotters backend
test ingest_accept/logs_10_rows_wal_ack ... bench:      388591 ns/iter (+/- 8122)
test ingest_accept/traces_10_spans_wal_ack ... bench:      389582 ns/iter (+/- 2390)

cargo test

test telemetry::tests::logs_severity_text_falls_back_to_mapped_number ... ok
test telemetry::tests::metric_encoder_matches_all_tinybird_datasource_shapes ... ok
test telemetry::tests::logs_use_observed_time_when_time_unix_nano_is_zero ... ok
test telemetry::tests::metrics_summary_data_points_are_dropped ... ok
test telemetry::tests::metrics_emit_exactly_the_jsonpaths_declared_in_datasources_ts ... ok
test telemetry::tests::migrate_legacy_shard_relocates_frames_into_lanes ... ok
test telemetry::tests::pipeline_can_start_for_clickhouse_only_without_tinybird_credentials ... ok
test telemetry::tests::clickhouse_export_drops_passworded_non_https_endpoint_without_sending ... ok
test telemetry::tests::pipeline_e2e_exports_gzip_ndjson_to_fake_tinybird ... ok
test telemetry::tests::pipeline_e2e_exports_traces_to_fake_tinybird ... ok
test telemetry::tests::sampling_keeps_errors_even_when_ratio_low ... ok
test telemetry::tests::scraper_contract::scraper_otlp_json_decodes_with_gateway_serde_and_encodes_to_rows ... ok
test telemetry::tests::pipeline_e2e_exports_metrics_to_fake_tinybird ... ok
test telemetry::tests::timestamp_has_nano_precision ... ok
test telemetry::tests::timestamps_match_clickhouse_datetime64_nine_format ... ok
test telemetry::tests::trace_encoder_matches_tinybird_row_shape ... ok
test telemetry::tests::traces_emit_exactly_the_jsonpaths_declared_in_datasources_ts ... ok
test telemetry::tests::wal_partial_drain_advances_cursor_without_truncating ... ok
test telemetry::tests::wal_round_trips_frame ... ok
test telemetry::tests::wal_truncates_after_full_drain_allowing_further_appends ... ok
test telemetry::tests::pipeline_exports_ready_org_to_clickhouse_without_tinybird_calls ... ok
test telemetry::tests::slow_clickhouse_lane_does_not_block_cosharded_tinybird_org ... ok
test telemetry::tests::clickhouse_breaker_sheds_after_threshold_failures ... ok

test result: ok. 33 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.78s

     Running unittests src/bin/load_test.rs (target/debug/deps/load_test-661a0aa1eb3f6d6d)

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running unittests src/main.rs (target/debug/deps/maple_ingest-c33bf80c577edb95)

running 37 tests
test autumn::tests::allowed_only_no_balance_field ... ok
test autumn::tests::flat_hardcap_with_remaining_allows ... ok
test autumn::tests::flat_hardcap_depleted_blocks ... ok
test autumn::tests::flat_overage_allows ... ok
test autumn::tests::flat_unlimited_allows ... ok
test autumn::tests::flat_sub_one_gb_remaining_still_allows ... ok
test autumn::tests::nested_balance_object_depleted_blocks ... ok
test autumn::tests::nested_balance_object_with_remaining_allows ... ok
test autumn::tests::nested_overage_allows ... ok
test autumn::tests::null_balance_no_subscription_blocks ... ok
test autumn::tests::unrecognized_shape_returns_none ... ok
test tests::api_error_kind_maps_status_to_stable_label ... ok
test tests::clickhouse_destination_is_terminal_in_dual_mode ... ok
test tests::clickhouse_destination_uses_native_pipeline_even_in_forward_mode ... ok
test tests::clickhouse_target_resolver_decrypts_current_schema_password ... ok
test tests::clickhouse_target_resolver_rejects_password_over_http ... ok
test tests::cloudflare_log_record_maps_body_severity_and_attributes ... ok
test tests::cloudflare_ndjson_payload_parses_multiple_records ... ok
test tests::clickhouse_target_resolver_requires_current_schema ... ok
test tests::cloudflare_timestamps_support_rfc3339_unix_and_unix_nano ... ok
test tests::cloudflare_validation_payload_is_detected ... ok
test tests::decrypt_aes256_gcm_matches_node_crypto_fixture ... ok
test tests::extract_ingest_key_returns_sentinel_literal_unchanged ... ok
test tests::enrichment_overwrites_tenant_fields ... ok
test tests::rejection_span_status_is_error_only_for_5xx ... ok
test tests::hash_is_deterministic ... ok
test tests::resolve_ingest_key_keeps_stale_schema_on_managed_native_path ... ok
test tests::resolve_connector_refreshes_routing_before_auth_cache_expires ... ok
test tests::resolve_ingest_key_refreshes_routing_before_auth_cache_expires ... ok
test tests::resolve_ingest_key_returns_none_when_hash_missing ... ok
test tests::resolve_ingest_key_returns_self_managed_false_when_no_settings_row ... ok
test tests::resolve_ingest_key_returns_self_managed_true_when_active_settings_row ... ok
test tests::sentinel_token_matches_only_exact_literal ... ok
test tests::tinybird_destination_keeps_forward_mode_on_forward_path ... ok
test tests::resolve_ingest_key_serves_last_known_routing_when_refresh_fails ... ok
test autumn::tests::fails_open_on_transport_error ... ok
test tests::forward_mode_switches_ready_org_to_clickhouse_without_forwarding_again ... ok

test result: ok. 37 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.24s

   Doc-tests maple_ingest

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

Makisuo · 2026-06-30T21:21:46Z

@pullfrog review

pullfrog · 2026-06-30T21:21:49Z

Your LLM provider API key was rejected. Rotate the key in your provider dashboard, then update the matching GitHub Actions secret.

Update repo secret → · Model settings → · Setup docs → · Ask in Discord →

^{｜ ⚠️ this action is pinned to a commit SHA, which freezes the cleanup step — switch to @v0 or keep the SHA fresh with Dependabot ｜ Rerun failed job ➔ ｜ View workflow run ｜ via Pullfrog ｜ Using Claude Opus ｜ 𝕏}

Makisuo · 2026-06-30T21:59:18Z

@pullfrog review

pullfrog · 2026-06-30T21:59:22Z

Your LLM provider API key was rejected. Rotate the key in your provider dashboard, then update the matching GitHub Actions secret.

Update repo secret → · Model settings → · Setup docs → · Ask in Discord →

^{｜ ⚠️ this action is pinned to a commit SHA, which freezes the cleanup step — switch to @v0 or keep the SHA fresh with Dependabot ｜ Rerun failed job ➔ ｜ View workflow run ｜ via Pullfrog ｜ Using Claude Opus ｜ 𝕏}

…ush handlers Extends the replay/session 5xx labeling to handle_signal (traces/logs/metrics — the majority of ingest traffic) and handle_cloudflare_logpush, so genuine 5xx failures there also carry a status message instead of bucketing under "Unknown Error". Addresses Devin review feedback. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Revert the bot's floating @v0 to a pinned commit SHA. The action runs with contents/PR/issues write + provider secrets, so a mutable tag is a supply-chain risk (per Devin review). Pin to the commit v0 currently resolves to; bump manually (or via Dependabot) on new releases. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Resolve apps/ingest/src/main.rs: main independently shipped the same otel.status_description fix (kill "Unknown Error" on 5xx) across all handlers, so take main's canonical version. My redundant message() accessor + conditional variant are dropped; the rest of this PR (scraper flush, SDK anticipated-error tracer, domain tags + wiring) is unaffected.

Covers the Die-vs-Interrupt asymmetry in isFullyAnticipated (Devin review): an interrupt co-occurring with an anticipated failure keeps the span Ok (interrupts are normal fiber control flow), whereas a defect forces Error. Intentional. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Makisuo had a problem deploying to pr-preview June 30, 2026 20:30 — with GitHub Actions Error

Makisuo temporarily deployed to pr-preview June 30, 2026 20:35 — with GitHub Actions Inactive

pullfrog :3

0ec9076

Makisuo temporarily deployed to pr-preview June 30, 2026 21:13 — with GitHub Actions Inactive

This comment was marked as resolved.

Sign in to view

Makisuo temporarily deployed to pr-preview June 30, 2026 22:26 — with GitHub Actions Inactive

This comment was marked as resolved.

Sign in to view

Makisuo temporarily deployed to pr-preview June 30, 2026 22:34 — with GitHub Actions Inactive

Makisuo temporarily deployed to pr-preview July 1, 2026 10:19 — with GitHub Actions Inactive

This comment was marked as resolved.

Sign in to view

Makisuo had a problem deploying to pr-preview July 1, 2026 10:27 — with GitHub Actions Error

Makisuo merged commit 36539be into main Jul 1, 2026
6 of 8 checks passed

Makisuo deleted the fix/anticipated-error-spans-and-scraper-flush branch July 1, 2026 10:30

Makisuo deployed to pr-preview July 1, 2026 10:30 — with GitHub Actions Active

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(observability): stop anticipated errors polluting error tracking + bound scraper flush#153

fix(observability): stop anticipated errors polluting error tracking + bound scraper flush#153
Makisuo merged 7 commits into
mainfrom
fix/anticipated-error-spans-and-scraper-flush

Makisuo commented Jun 30, 2026

Uh oh!

pullfrog Bot commented Jun 30, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 30, 2026 •

edited

Loading

Uh oh!

Makisuo commented Jun 30, 2026

Uh oh!

pullfrog Bot commented Jun 30, 2026 •

edited

Loading

Uh oh!

Makisuo commented Jun 30, 2026

Uh oh!

pullfrog Bot commented Jun 30, 2026 •

edited

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Makisuo commented Jun 30, 2026

Why

What changed

Anticipated errors → Ok spans

Scraper flush

ingest "Unknown Error" on /v1/sessionReplays/blob

Already-fixed (no change)

Reviewer notes

Verification

Out of scope (follow-ups)

Uh oh!

pullfrog Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Ingest Rust Test + Benchmark Results

Load Benchmark — tinybird mode, median of 3 run(s) vs main

WAL-acked microbench (cargo bench --bench ingest_bench)

cargo test

Uh oh!

Makisuo commented Jun 30, 2026

Uh oh!

pullfrog Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Makisuo commented Jun 30, 2026

Uh oh!

pullfrog Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Anticipated errors → `Ok` spans

ingest "Unknown Error" on `/v1/sessionReplays/blob`

pullfrog Bot commented Jun 30, 2026 •

edited

Loading

github-actions Bot commented Jun 30, 2026 •

edited

Loading

Load Benchmark — `tinybird` mode, median of 3 run(s) vs main

WAL-acked microbench (`cargo bench --bench ingest_bench`)

pullfrog Bot commented Jun 30, 2026 •

edited

Loading

pullfrog Bot commented Jun 30, 2026 •

edited

Loading