Skip to content

fix(observability): stop anticipated errors polluting error tracking + bound scraper flush#153

Merged
Makisuo merged 7 commits into
mainfrom
fix/anticipated-error-spans-and-scraper-flush
Jul 1, 2026
Merged

fix(observability): stop anticipated errors polluting error tracking + bound scraper flush#153
Makisuo merged 7 commits into
mainfrom
fix/anticipated-error-spans-and-scraper-flush

Conversation

@Makisuo

@Makisuo Makisuo commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

Why

Auditing Maple's own MCP error data surfaced ~2,000 error spans/6h, most of them not real bugs:

  • Anticipated 4xx outcomes recorded as Error spans (validation, not-found, unauthorized, integrations-not-connected). Maple's error tracking materializes from traces WHERE StatusCode='Error', and the shared OTLP tracer set status purely from the Effect exit — so any expected error that propagated as a failure marked every span it passed through as Error (empty-message ones surfaced as {} / "Unknown Error"). The otel.status_code attribute was being ignored.
  • Scraper flush transport errorsapps/scraper POSTed its whole result buffer (up to 10k rows) to /api/internal/scrape-results in one body with no request timeout, overwhelming the API Worker (edge 503) and hanging 87–280s.

What changed

Anticipated errors → Ok spans

  • lib/effect-sdk flushable-tracer: new anticipatedErrorTags option (threaded through the cloudflare/server/client presets). A span failing entirely with an anticipated error now records OTLP Ok and skips the exception event — it stays visible as a trace (latency, http.response.status_code) but never counts as an error. A defect (Die) mixed in still marks Error.
  • @maple/domain/anticipated-errors: ANTICIPATED_ERROR_TAGS, derived automatically from every domain error annotated with a 4xx httpApiStatus (zero-drift — new 4xx errors are picked up without edits). Wired into maple-api, maple-web, alerting, and the api runtimes (billing-suspension, vcs-sync, AiTriageWorkflow).
  • Mirrors the ingest gateway's existing otel_status_for_rejection (4xx → Ok) rule on the TypeScript side.

Scraper flush

  • ApiClient: 30s request timeout on all API calls (was unbounded).
  • ScrapeScheduler: extracted sendResultsInChunks — flush POSTs in 1,000-row chunks and re-buffers only the unsent remainder on the first failure.

ingest "Unknown Error" on /v1/sessionReplays/blob

  • The replay/session handlers (meta, blob, sessionEvents) now record otel.status_description with the error message on 5xx, so genuine failures get a categorizable label instead of bucketing as "Unknown Error".

Already-fixed (no change)

  • The maple-cli {} on /v1/traces was already addressed by the IngestRejected refactor (c7dbc41ab); the stale occurrences age out as binaries update.

Reviewer notes

  • The otel.status_code: "Ok" annotations in vcs-webhook/billing-webhook were no-ops (the tracer ignored the attribute); they still work because those handlers return success exits. Left as-is.
  • anticipatedErrorTags is reflection-derived via public Schema AST annotations (ast.annotations.httpApiStatus, _tag literal) — guarded by a unit test asserting the four observed tags resolve and 5xx errors don't.

Verification

  • bun typecheck — 24/24 packages pass
  • vitest: effect-sdk (64), domain (155), scraper (59) — incl. new tracer / anticipated-errors / chunking tests
  • ingest: cargo check + otel/api_error tests pass
  • Post-deploy: re-run find_errors over a fresh window to confirm the buckets are gone / relabeled.

Out of scope (follow-ups)

  • alerting Postgres permission denied flood (~1,635/6h — runtime role missing GRANTs after table rebuilds).
  • Clerk digest "Failed to list Clerk members" and the CLERK_JWT_KEY "Invalid RSA key in JSON Web Key" config issue.

🤖 Generated with Claude Code

…+ bound scraper flush

Maple's error tracking materializes from `traces WHERE StatusCode='Error'`, but
the shared OTLP tracer set span status purely from the Effect exit — so every
expected 4xx outcome that propagated as a failure became an `Error` span
(empty-message ones surfaced as `{}` / "Unknown Error"). The `otel.status_code`
attribute was ignored. Separately, the Prometheus scraper's result flush POSTed
its whole buffer (up to 10k rows) in one body with no timeout, overwhelming the
API Worker (edge 503) and hanging for minutes.

Anticipated errors → Ok spans:
- flushable-tracer: new `anticipatedErrorTags` option; a span failing *entirely*
  with an anticipated error records OTLP `Ok` and skips the `exception` event
  (kept as a trace, latency/status intact). Defects still mark `Error`.
- @maple/domain/anticipated-errors: `ANTICIPATED_ERROR_TAGS` derived
  automatically from every domain error annotated `httpApiStatus` 4xx (zero
  drift). Wired into maple-api, maple-web, alerting, and the api runtimes.
- Covers UnauthorizedError, RawSqlValidationError, AiTriageNotFoundError,
  IntegrationsNotConnectedError, and all other 4xx business errors.

Scraper flush:
- ApiClient: 30s request timeout on all API calls (was unbounded).
- ScrapeScheduler: `sendResultsInChunks` POSTs in 1k-row chunks and re-buffers
  only the unsent remainder on failure.

ingest "Unknown Error":
- replay/session handlers record `otel.status_description` with the error
  message on 5xx, so genuine failures get a categorizable label.

Verified: bun typecheck (24/24), effect-sdk/domain/scraper vitest suites, ingest
cargo check + otel/api_error tests.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@pullfrog

pullfrog Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Your LLM provider API key was rejected. Rotate the key in your provider dashboard, then update the matching GitHub Actions secret.

Update repo secret → · Model settings → · Setup docs → · Ask in Discord →

Pullfrog  | ⚠️ this action is pinned to a commit SHA, which freezes the cleanup step — switch to @v0 or keep the SHA fresh with Dependabot | Rerun failed job ➔View workflow run | via Pullfrog | Using Claude Opus𝕏

This file was untracked WIP at branch point (never in main, nothing imports
it) and got swept into the previous commit, tripping Knip's unused-files check.
It's unrelated to this PR. The anticipated-errors wiring stays in the api
worker, alerting, vcs-sync, and AiTriageWorkflow runtimes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 30, 2026

Copy link
Copy Markdown

Ingest Rust Test + Benchmark Results

Commit: acf2d7f25977fdb026077e1cd9b25681bc0fb387

Load Benchmark — tinybird mode, median of 3 run(s) vs main

Metric main (median) PR (median) Delta
Requests/sec 3100.04 2751.38 -11.2% worse
Rows/sec 31000.38 27513.84 -11.2% worse
p50 latency 20.25 ms 22.93 ms +13.3% worse
p95 latency 39.16 ms 30.35 ms -22.5% better
p99 latency 43.76 ms 43.22 ms -1.2% better
Export catch-up 0.026 s 0.026 s -0.5% better
Max RSS 100.23 MiB 102.09 MiB +1.9% worse
Failures 0 0 same

Same code path on both sides (same LOAD_TEST_INGEST_MODE), so the delta column is meaningful. Numbers come from ubuntu-latest, which is noisy — treat single-digit-percent deltas as noise.

PR load benchmark JSON (per-iteration)
[
  {
    "ingest_mode": "tinybird",
    "requests": 2000,
    "successes": 2000,
    "failures": 0,
    "rows_sent": 20000,
    "rows_exported": 20000,
    "imports": 23,
    "duration_seconds": 0.744907582,
    "export_catchup_seconds": 0.026594406,
    "request_rps": 2684.8968225430144,
    "row_rps": 26848.968225430148,
    "p50_ms": 22.932,
    "p95_ms": 30.353,
    "p99_ms": 44.752,
    "max_rss_mb": 106.671875,
    "max_cpu_percent": 83.9,
    "avg_cpu_percent": 51.95
  },
  {
    "ingest_mode": "tinybird",
    "requests": 2000,
    "successes": 2000,
    "failures": 0,
    "rows_sent": 20000,
    "rows_exported": 20000,
    "imports": 23,
    "duration_seconds": 0.726906986,
    "export_catchup_seconds": 0.0260684,
    "request_rps": 2751.3836550196534,
    "row_rps": 27513.83655019653,
    "p50_ms": 23.145,
    "p95_ms": 24.847,
    "p99_ms": 43.217,
    "max_rss_mb": 101.03125,
    "max_cpu_percent": 87.5,
    "avg_cpu_percent": 52.05
  },
  {
    "ingest_mode": "tinybird",
    "requests": 2000,
    "successes": 2000,
    "failures": 0,
    "rows_sent": 20000,
    "rows_exported": 20000,
    "imports": 20,
    "duration_seconds": 0.638464663,
    "export_catchup_seconds": 0.026018714,
    "request_rps": 3132.514790407437,
    "row_rps": 31325.14790407437,
    "p50_ms": 20.185,
    "p95_ms": 36.512,
    "p99_ms": 40.27,
    "max_rss_mb": 102.08984375,
    "max_cpu_percent": 96.4,
    "avg_cpu_percent": 64.85
  }
]
main load benchmark JSON (per-iteration)
[
  {
    "ingest_mode": "tinybird",
    "requests": 2000,
    "successes": 2000,
    "failures": 0,
    "rows_sent": 20000,
    "rows_exported": 20000,
    "imports": 24,
    "duration_seconds": 0.713893224,
    "export_catchup_seconds": 0.025764818,
    "request_rps": 2801.539407803652,
    "row_rps": 28015.394078036523,
    "p50_ms": 22.136,
    "p95_ms": 25.94,
    "p99_ms": 47.767,
    "max_rss_mb": 99.90234375,
    "max_cpu_percent": 89.2,
    "avg_cpu_percent": 52.900000000000006
  },
  {
    "ingest_mode": "tinybird",
    "requests": 2000,
    "successes": 2000,
    "failures": 0,
    "rows_sent": 20000,
    "rows_exported": 20000,
    "imports": 22,
    "duration_seconds": 0.644295414,
    "export_catchup_seconds": 0.026399704,
    "request_rps": 3104.1661271237917,
    "row_rps": 31041.661271237917,
    "p50_ms": 20.178,
    "p95_ms": 39.161,
    "p99_ms": 40.798,
    "max_rss_mb": 102.8984375,
    "max_cpu_percent": 96.4,
    "avg_cpu_percent": 56.5
  },
  {
    "ingest_mode": "tinybird",
    "requests": 2000,
    "successes": 2000,
    "failures": 0,
    "rows_sent": 20000,
    "rows_exported": 20000,
    "imports": 22,
    "duration_seconds": 0.645153394,
    "export_catchup_seconds": 0.026190869,
    "request_rps": 3100.037942294387,
    "row_rps": 31000.379422943868,
    "p50_ms": 20.247,
    "p95_ms": 39.917,
    "p99_ms": 43.757,
    "max_rss_mb": 100.2265625,
    "max_cpu_percent": 96.4,
    "avg_cpu_percent": 64.85
  }
]

WAL-acked microbench (cargo bench --bench ingest_bench)

   Compiling maple-ingest v0.1.0 (/home/runner/work/maple/maple/apps/ingest)
    Finished `bench` profile [optimized] target(s) in 39.58s
     Running benches/ingest_bench.rs (target/release/deps/ingest_bench-581d2100de893627)
Gnuplot not found, using plotters backend
test ingest_accept/logs_10_rows_wal_ack ... bench:      388591 ns/iter (+/- 8122)
test ingest_accept/traces_10_spans_wal_ack ... bench:      389582 ns/iter (+/- 2390)

cargo test

test telemetry::tests::logs_severity_text_falls_back_to_mapped_number ... ok
test telemetry::tests::metric_encoder_matches_all_tinybird_datasource_shapes ... ok
test telemetry::tests::logs_use_observed_time_when_time_unix_nano_is_zero ... ok
test telemetry::tests::metrics_summary_data_points_are_dropped ... ok
test telemetry::tests::metrics_emit_exactly_the_jsonpaths_declared_in_datasources_ts ... ok
test telemetry::tests::migrate_legacy_shard_relocates_frames_into_lanes ... ok
test telemetry::tests::pipeline_can_start_for_clickhouse_only_without_tinybird_credentials ... ok
test telemetry::tests::clickhouse_export_drops_passworded_non_https_endpoint_without_sending ... ok
test telemetry::tests::pipeline_e2e_exports_gzip_ndjson_to_fake_tinybird ... ok
test telemetry::tests::pipeline_e2e_exports_traces_to_fake_tinybird ... ok
test telemetry::tests::sampling_keeps_errors_even_when_ratio_low ... ok
test telemetry::tests::scraper_contract::scraper_otlp_json_decodes_with_gateway_serde_and_encodes_to_rows ... ok
test telemetry::tests::pipeline_e2e_exports_metrics_to_fake_tinybird ... ok
test telemetry::tests::timestamp_has_nano_precision ... ok
test telemetry::tests::timestamps_match_clickhouse_datetime64_nine_format ... ok
test telemetry::tests::trace_encoder_matches_tinybird_row_shape ... ok
test telemetry::tests::traces_emit_exactly_the_jsonpaths_declared_in_datasources_ts ... ok
test telemetry::tests::wal_partial_drain_advances_cursor_without_truncating ... ok
test telemetry::tests::wal_round_trips_frame ... ok
test telemetry::tests::wal_truncates_after_full_drain_allowing_further_appends ... ok
test telemetry::tests::pipeline_exports_ready_org_to_clickhouse_without_tinybird_calls ... ok
test telemetry::tests::slow_clickhouse_lane_does_not_block_cosharded_tinybird_org ... ok
test telemetry::tests::clickhouse_breaker_sheds_after_threshold_failures ... ok

test result: ok. 33 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.78s

     Running unittests src/bin/load_test.rs (target/debug/deps/load_test-661a0aa1eb3f6d6d)

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running unittests src/main.rs (target/debug/deps/maple_ingest-c33bf80c577edb95)

running 37 tests
test autumn::tests::allowed_only_no_balance_field ... ok
test autumn::tests::flat_hardcap_with_remaining_allows ... ok
test autumn::tests::flat_hardcap_depleted_blocks ... ok
test autumn::tests::flat_overage_allows ... ok
test autumn::tests::flat_unlimited_allows ... ok
test autumn::tests::flat_sub_one_gb_remaining_still_allows ... ok
test autumn::tests::nested_balance_object_depleted_blocks ... ok
test autumn::tests::nested_balance_object_with_remaining_allows ... ok
test autumn::tests::nested_overage_allows ... ok
test autumn::tests::null_balance_no_subscription_blocks ... ok
test autumn::tests::unrecognized_shape_returns_none ... ok
test tests::api_error_kind_maps_status_to_stable_label ... ok
test tests::clickhouse_destination_is_terminal_in_dual_mode ... ok
test tests::clickhouse_destination_uses_native_pipeline_even_in_forward_mode ... ok
test tests::clickhouse_target_resolver_decrypts_current_schema_password ... ok
test tests::clickhouse_target_resolver_rejects_password_over_http ... ok
test tests::cloudflare_log_record_maps_body_severity_and_attributes ... ok
test tests::cloudflare_ndjson_payload_parses_multiple_records ... ok
test tests::clickhouse_target_resolver_requires_current_schema ... ok
test tests::cloudflare_timestamps_support_rfc3339_unix_and_unix_nano ... ok
test tests::cloudflare_validation_payload_is_detected ... ok
test tests::decrypt_aes256_gcm_matches_node_crypto_fixture ... ok
test tests::extract_ingest_key_returns_sentinel_literal_unchanged ... ok
test tests::enrichment_overwrites_tenant_fields ... ok
test tests::rejection_span_status_is_error_only_for_5xx ... ok
test tests::hash_is_deterministic ... ok
test tests::resolve_ingest_key_keeps_stale_schema_on_managed_native_path ... ok
test tests::resolve_connector_refreshes_routing_before_auth_cache_expires ... ok
test tests::resolve_ingest_key_refreshes_routing_before_auth_cache_expires ... ok
test tests::resolve_ingest_key_returns_none_when_hash_missing ... ok
test tests::resolve_ingest_key_returns_self_managed_false_when_no_settings_row ... ok
test tests::resolve_ingest_key_returns_self_managed_true_when_active_settings_row ... ok
test tests::sentinel_token_matches_only_exact_literal ... ok
test tests::tinybird_destination_keeps_forward_mode_on_forward_path ... ok
test tests::resolve_ingest_key_serves_last_known_routing_when_refresh_fails ... ok
test autumn::tests::fails_open_on_transport_error ... ok
test tests::forward_mode_switches_ready_org_to_clickhouse_without_forwarding_again ... ok

test result: ok. 37 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.24s

   Doc-tests maple_ingest

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

@Makisuo

Makisuo commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator Author

@pullfrog review

@pullfrog

pullfrog Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Your LLM provider API key was rejected. Rotate the key in your provider dashboard, then update the matching GitHub Actions secret.

Update repo secret → · Model settings → · Setup docs → · Ask in Discord →

Pullfrog  | ⚠️ this action is pinned to a commit SHA, which freezes the cleanup step — switch to @v0 or keep the SHA fresh with Dependabot | Rerun failed job ➔View workflow run | via Pullfrog | Using Claude Opus𝕏

@Makisuo

Makisuo commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator Author

@pullfrog review

@pullfrog

pullfrog Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Your LLM provider API key was rejected. Rotate the key in your provider dashboard, then update the matching GitHub Actions secret.

Update repo secret → · Model settings → · Setup docs → · Ask in Discord →

Pullfrog  | ⚠️ this action is pinned to a commit SHA, which freezes the cleanup step — switch to @v0 or keep the SHA fresh with Dependabot | Rerun failed job ➔View workflow run | via Pullfrog | Using Claude Opus𝕏

devin-ai-integration[bot]

This comment was marked as resolved.

…ush handlers

Extends the replay/session 5xx labeling to handle_signal (traces/logs/metrics —
the majority of ingest traffic) and handle_cloudflare_logpush, so genuine 5xx
failures there also carry a status message instead of bucketing under
"Unknown Error". Addresses Devin review feedback.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
devin-ai-integration[bot]

This comment was marked as resolved.

Revert the bot's floating @v0 to a pinned commit SHA. The action runs with
contents/PR/issues write + provider secrets, so a mutable tag is a supply-chain
risk (per Devin review). Pin to the commit v0 currently resolves to; bump
manually (or via Dependabot) on new releases.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Resolve apps/ingest/src/main.rs: main independently shipped the same
otel.status_description fix (kill "Unknown Error" on 5xx) across all handlers,
so take main's canonical version. My redundant message() accessor + conditional
variant are dropped; the rest of this PR (scraper flush, SDK anticipated-error
tracer, domain tags + wiring) is unaffected.
devin-ai-integration[bot]

This comment was marked as resolved.

Covers the Die-vs-Interrupt asymmetry in isFullyAnticipated (Devin review): an
interrupt co-occurring with an anticipated failure keeps the span Ok (interrupts
are normal fiber control flow), whereas a defect forces Error. Intentional.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@Makisuo Makisuo merged commit 36539be into main Jul 1, 2026
6 of 8 checks passed
@Makisuo Makisuo deleted the fix/anticipated-error-spans-and-scraper-flush branch July 1, 2026 10:30
@Makisuo Makisuo deployed to pr-preview July 1, 2026 10:30 — with GitHub Actions Active
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant