fix(ingest): reclassify replay backpressure 503s + label error spans#157
Conversation
Session-replay blob POSTs to a degraded BYO-ClickHouse org surfaced as 139 "Unknown Error" 503s. Root cause: while the ClickHouse export lane stalled (breaker circuit_open, ~7.8k rows dropped), the single-threaded export worker held the lane through its retry budget, so new frames filled the lane's bounded channel and the accept path shed them as PipelineError::Backpressure -> 503. The handler set otel.status_code="Error" but never otel.status_description, so the error MV (which reads StatusMessage, not the error.type attribute) mapped the empty message to the literal "Unknown Error" and collapsed all sheds into one fingerprint. - Reclassify transient Throttled/Backpressure as 429 (retryable, classified Ok so it leaves the error view) via a shared api_error_from_pipeline() used by both the OTLP signal path and the 3 session-replay handlers; QueueUnavailable/ Encode stay 503. Also fixes a latent bug where Throttled on a replay path wrongly returned 503. - Record otel.status_description = error.message on every ingest error span so remaining genuine 5xx surface with a real tag/message, not "Unknown Error". - Add ingest_backpressure_shed_total counter at the shed site (it emitted no metric, which is why this was hard to diagnose). - Bound the lane hold at the source: dedicated short ClickHouse export timeout (INGEST_CLICKHOUSE_EXPORT_TIMEOUT_MS, default 3s, was reusing the 10s forward_timeout) and lower breaker failure_threshold 4 -> 2. Note: Backpressure -> 429 also applies to the OTLP traces/logs/metrics path; both 429 and 503 are retryable per the OTel spec. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Your LLM provider API key was rejected. Rotate the key in your provider dashboard, then update the matching GitHub Actions secret. Update repo secret → · Model settings → · Setup docs → · Ask in Discord →
|
Ingest Rust Test + Benchmark ResultsCommit: Load Benchmark —
|
| Metric | main (median) | PR (median) | Delta |
|---|---|---|---|
| Requests/sec | 1876.96 | 1727.45 | -8.0% worse |
| Rows/sec | 18769.59 | 17274.48 | -8.0% worse |
| p50 latency | 33.13 ms | 36.13 ms | +9.1% worse |
| p95 latency | 41.46 ms | 48.80 ms | +17.7% worse |
| p99 latency | 51.36 ms | 65.94 ms | +28.4% worse |
| Export catch-up | 0.026 s | 0.026 s | +0.4% worse |
| Max RSS | 104.52 MiB | 103.36 MiB | -1.1% better |
| Failures | 0 | 0 | same |
Same code path on both sides (same LOAD_TEST_INGEST_MODE), so the delta column is meaningful. Numbers come from ubuntu-latest, which is noisy — treat single-digit-percent deltas as noise.
PR load benchmark JSON (per-iteration)
[
{
"ingest_mode": "tinybird",
"requests": 2000,
"successes": 2000,
"failures": 0,
"rows_sent": 20000,
"rows_exported": 20000,
"imports": 26,
"duration_seconds": 1.269508861,
"export_catchup_seconds": 0.026087098,
"request_rps": 1575.4123987953794,
"row_rps": 15754.123987953795,
"p50_ms": 38.428,
"p95_ms": 57.233,
"p99_ms": 65.935,
"max_rss_mb": 99.03125,
"max_cpu_percent": 50.0,
"avg_cpu_percent": 38.53333333333333
},
{
"ingest_mode": "tinybird",
"requests": 2000,
"successes": 2000,
"failures": 0,
"rows_sent": 20000,
"rows_exported": 20000,
"imports": 26,
"duration_seconds": 1.157777068,
"export_catchup_seconds": 0.025640517,
"request_rps": 1727.4482759059106,
"row_rps": 17274.482759059105,
"p50_ms": 34.951,
"p95_ms": 48.798,
"p99_ms": 67.379,
"max_rss_mb": 103.359375,
"max_cpu_percent": 56.4,
"avg_cpu_percent": 47.73333333333333
},
{
"ingest_mode": "tinybird",
"requests": 2000,
"successes": 2000,
"failures": 0,
"rows_sent": 20000,
"rows_exported": 20000,
"imports": 26,
"duration_seconds": 1.15215423,
"export_catchup_seconds": 0.026406878,
"request_rps": 1735.8787113075998,
"row_rps": 17358.787113075996,
"p50_ms": 36.133,
"p95_ms": 43.983,
"p99_ms": 49.128,
"max_rss_mb": 107.25,
"max_cpu_percent": 57.1,
"avg_cpu_percent": 43.06666666666666
}
]main load benchmark JSON (per-iteration)
[
{
"ingest_mode": "tinybird",
"requests": 2000,
"successes": 2000,
"failures": 0,
"rows_sent": 20000,
"rows_exported": 20000,
"imports": 25,
"duration_seconds": 1.212915826,
"export_catchup_seconds": 0.025989565,
"request_rps": 1648.9190404874807,
"row_rps": 16489.190404874807,
"p50_ms": 36.852,
"p95_ms": 49.731,
"p99_ms": 58.084,
"max_rss_mb": 104.51953125,
"max_cpu_percent": 55.3,
"avg_cpu_percent": 41.53333333333334
},
{
"ingest_mode": "tinybird",
"requests": 2000,
"successes": 2000,
"failures": 0,
"rows_sent": 20000,
"rows_exported": 20000,
"imports": 26,
"duration_seconds": 1.065553551,
"export_catchup_seconds": 0.026375949,
"request_rps": 1876.9586926185373,
"row_rps": 18769.586926185373,
"p50_ms": 33.133,
"p95_ms": 39.612,
"p99_ms": 40.575,
"max_rss_mb": 102.4296875,
"max_cpu_percent": 59.8,
"avg_cpu_percent": 50.06666666666666
},
{
"ingest_mode": "tinybird",
"requests": 2000,
"successes": 2000,
"failures": 0,
"rows_sent": 20000,
"rows_exported": 20000,
"imports": 24,
"duration_seconds": 1.046649844,
"export_catchup_seconds": 0.025305608,
"request_rps": 1910.8587379677667,
"row_rps": 19108.58737967767,
"p50_ms": 31.689,
"p95_ms": 41.462,
"p99_ms": 51.36,
"max_rss_mb": 107.59375,
"max_cpu_percent": 60.7,
"avg_cpu_percent": 45.4
}
]WAL-acked microbench (cargo bench --bench ingest_bench)
Compiling maple-ingest v0.1.0 (/home/runner/work/maple/maple/apps/ingest)
Finished `bench` profile [optimized] target(s) in 40.04s
Running benches/ingest_bench.rs (target/release/deps/ingest_bench-581d2100de893627)
Gnuplot not found, using plotters backend
test ingest_accept/logs_10_rows_wal_ack ... bench: 578108 ns/iter (+/- 37921)
test ingest_accept/traces_10_spans_wal_ack ... bench: 601082 ns/iter (+/- 51719)
cargo test
test telemetry::tests::metrics_emit_exactly_the_jsonpaths_declared_in_datasources_ts ... ok
test telemetry::tests::metrics_summary_data_points_are_dropped ... ok
test telemetry::tests::migrate_legacy_shard_relocates_frames_into_lanes ... ok
test telemetry::tests::pipeline_can_start_for_clickhouse_only_without_tinybird_credentials ... ok
test telemetry::tests::clickhouse_export_drops_passworded_non_https_endpoint_without_sending ... ok
test telemetry::tests::pipeline_e2e_exports_gzip_ndjson_to_fake_tinybird ... ok
test telemetry::tests::pipeline_e2e_exports_metrics_to_fake_tinybird ... ok
test telemetry::tests::sampling_keeps_errors_even_when_ratio_low ... ok
test telemetry::tests::scraper_contract::scraper_otlp_json_decodes_with_gateway_serde_and_encodes_to_rows ... ok
test telemetry::tests::signal_tag_round_trips_all_variants ... ok
test telemetry::tests::pipeline_e2e_exports_traces_to_fake_tinybird ... ok
test telemetry::tests::telemetry_signal_as_str_is_canonical_lowercase ... ok
test telemetry::tests::timestamp_has_nano_precision ... ok
test telemetry::tests::timestamps_match_clickhouse_datetime64_nine_format ... ok
test telemetry::tests::trace_encoder_matches_tinybird_row_shape ... ok
test telemetry::tests::traces_emit_exactly_the_jsonpaths_declared_in_datasources_ts ... ok
test telemetry::tests::wal_partial_drain_advances_cursor_without_truncating ... ok
test telemetry::tests::wal_round_trips_frame ... ok
test telemetry::tests::wal_truncates_after_full_drain_allowing_further_appends ... ok
test telemetry::tests::pipeline_exports_ready_org_to_clickhouse_without_tinybird_calls ... ok
test telemetry::tests::slow_clickhouse_lane_does_not_block_cosharded_tinybird_org ... ok
test telemetry::tests::clickhouse_breaker_sheds_after_threshold_failures ... ok
test result: ok. 36 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.78s
Running unittests src/bin/load_test.rs (target/debug/deps/load_test-661a0aa1eb3f6d6d)
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s
Running unittests src/main.rs (target/debug/deps/maple_ingest-c33bf80c577edb95)
running 38 tests
test autumn::tests::allowed_only_no_balance_field ... ok
test autumn::tests::flat_hardcap_with_remaining_allows ... ok
test autumn::tests::flat_hardcap_depleted_blocks ... ok
test autumn::tests::flat_unlimited_allows ... ok
test autumn::tests::flat_overage_allows ... ok
test autumn::tests::flat_sub_one_gb_remaining_still_allows ... ok
test autumn::tests::nested_balance_object_depleted_blocks ... ok
test autumn::tests::nested_balance_object_with_remaining_allows ... ok
test autumn::tests::nested_overage_allows ... ok
test autumn::tests::null_balance_no_subscription_blocks ... ok
test autumn::tests::unrecognized_shape_returns_none ... ok
test tests::api_error_from_pipeline_maps_variants_to_status ... ok
test tests::api_error_kind_maps_status_to_stable_label ... ok
test tests::clickhouse_destination_uses_native_pipeline_even_in_forward_mode ... ok
test tests::clickhouse_destination_is_terminal_in_dual_mode ... ok
test tests::clickhouse_target_resolver_decrypts_current_schema_password ... ok
test tests::clickhouse_target_resolver_requires_current_schema ... ok
test tests::cloudflare_ndjson_payload_parses_multiple_records ... ok
test tests::cloudflare_log_record_maps_body_severity_and_attributes ... ok
test tests::clickhouse_target_resolver_rejects_password_over_http ... ok
test tests::cloudflare_timestamps_support_rfc3339_unix_and_unix_nano ... ok
test tests::decrypt_aes256_gcm_matches_node_crypto_fixture ... ok
test tests::enrichment_overwrites_tenant_fields ... ok
test tests::cloudflare_validation_payload_is_detected ... ok
test tests::extract_ingest_key_returns_sentinel_literal_unchanged ... ok
test tests::hash_is_deterministic ... ok
test tests::rejection_span_status_is_error_only_for_5xx ... ok
test tests::resolve_ingest_key_keeps_stale_schema_on_managed_native_path ... ok
test tests::resolve_connector_refreshes_routing_before_auth_cache_expires ... ok
test tests::resolve_ingest_key_returns_none_when_hash_missing ... ok
test tests::resolve_ingest_key_refreshes_routing_before_auth_cache_expires ... ok
test tests::resolve_ingest_key_returns_self_managed_false_when_no_settings_row ... ok
test tests::resolve_ingest_key_returns_self_managed_true_when_active_settings_row ... ok
test tests::sentinel_token_matches_only_exact_literal ... ok
test tests::tinybird_destination_keeps_forward_mode_on_forward_path ... ok
test autumn::tests::fails_open_on_transport_error ... ok
test tests::resolve_ingest_key_serves_last_known_routing_when_refresh_fails ... ok
test tests::forward_mode_switches_ready_org_to_clickhouse_without_forwarding_again ... ok
test result: ok. 38 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.23s
Doc-tests maple_ingest
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s
…tric
Addresses review feedback: ingest_backpressure_shed_total used Debug format
(PascalCase "Traces"/"SessionReplays") for the signal label, which cannot
correlate with other per-signal metrics that use lowercase. Add
TelemetrySignal::as_str() returning the canonical vocabulary
("traces"/"logs"/"metrics"/"session_replays") — matching signal.path() and the
maple.signal span attribute — and use it at the shed site. Note "session_replays"
(snake_case) is correct here, whereas Debug + to_lowercase would give the
inconsistent "sessionreplays".
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The WAL-acked microbench (benches/ingest_bench.rs) constructs TinybirdConfig directly and was missed when the field was added — `cargo test` doesn't compile benches, so CI's `cargo bench` caught it. Set it to 5s to match the other test configs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
🚩 OTLP signal path labels all process_decoded_payload errors as error_kind="forward"
In handle_signal_inner (main.rs:2400-2401), errors from process_decoded_payload — including native pipeline rejections (throttle, backpressure, WAL I/O) — are all mapped to error_kind="forward". This means ingest_requests_total{error_kind="forward"} conflates actual collector forward failures with native pipeline rejections. The same pre-existing issue exists in the Cloudflare logpush path (main.rs:2593). The PR doesn't change this mapping, but the status code shift (backpressure 503→429) makes it more important to distinguish these in metrics since the error_kind label no longer correlates 1:1 with a status bucket.
(Refers to lines 2400-2401)
Was this helpful? React with 👍 or 👎 to provide feedback.
…on_replays Addresses review feedback: accept_rows_to() hardcoded TelemetrySignal:: SessionReplays for all row-based data, so session-events frames (and thus the new ingest_backpressure_shed_total signal label + the pre-existing export_batch_completed label) reported signal="session_replays" even though the span carries maple.signal="session_events". Add a SessionEvents variant (as_str="session_events", WAL tag 5) and thread a TelemetrySignal parameter through accept_rows_to/accept_rows so each handler passes its true signal (blob/meta -> SessionReplays, events -> SessionEvents). Adds as_str + signal_tag round-trip tests covering the new variant. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

What & why
Production error
fingerprint 2492876445615472305— 139 × "Unknown Error" (2026-06-30 → 07-01) — was a HTTP 503unavailableonPOST /v1/sessionReplays/blobin the ingest gateway, all for one BYO-ClickHouse org (destination=clickhouse).Root cause (confirmed against code + exported
ingest_*metrics):ingest_clickhouse_export_dropped_total{status="circuit_open"}= 29 + 7,793 rows).ExportWorker::runis single-threaded per lane: itrecv()s a batch thenawaits the full retry budget before recv()ing again. Each failing attempt reused the shared HTTP client's 10sforward_timeout, so the initial breaker trip (failure_threshold=4) held the lane ~40s+, plus ~10s per 30s half-open probe.commit_frames) shed them viatry_reserve_owned() → PipelineError::Backpressure. That shed path emitted no metric.PipelineErrortoservice_unavailable(503).otel.status_code="Error"but neverotel.status_description, soStatusMessagewas empty. Maple'serror_events_mvmaps emptyStatusMessage→ literal'Unknown Error'and hashes all such spans into one fingerprint. (Theerror.typeattribute the handler does set is not read by the label logic.)Metrics ruled out the other shed variants for this org:
org_throttled=0,org_queue_bytespeaked ~38 MB (« the 1 GiB cap),wal_shard_full{clickhouse}=0 — leavingBackpressureas the residual.Changes (all in
apps/ingest/src/)api_error_from_pipeline()used by both the OTLP signal path and the 3 session-replay handlers:Throttled/Backpressure→ 429 (retryable;otel_status_for_rejection→Ok, so it leaves the error view),QueueUnavailable/Encode→ 503. Also fixes a latent bug whereThrottledon a replay path wrongly returned 503.otel.status_descriptionon each handler span and recorderror.messageon theErrarm, so any remaining genuine 5xx (WAL failure, storage-not-configured, auth-resolver down, collector transport/5xx) surfaces with a real tag/message instead of "Unknown Error".ingest_backpressure_shed_totalcounter at the shed site (org_id/destination/signal) — the absence of this metric is what made the incident hard to diagnose.INGEST_CLICKHOUSE_EXPORT_TIMEOUT_MS, default 3s, was reusing the 10sforward_timeout) applied per-request, and lower breakerfailure_threshold4 → 2. Worst-case lane hold drops ~40s → ~6s, so the channel stays drained and sheds become rare.Reviewer notes
Backpressure → 429also changes the OTLP path (traces/logs/metrics), not just replays. Both 429 and 503 are retryable per the OTel spec; this keeps expected load-shedding out of the error dashboards."failed to enqueue …: {e}"prefix); the context is preserved in a serverwarn!log. No test asserts on those body strings.Tests
api_error_from_pipeline_maps_variants_to_status(429/503 mapping +otel_status_for_rejection),breaker_default_failure_threshold_is_two.clickhouse_breaker_sheds_after_threshold_failures/ co-shard drain tests cover Part 4.cargo test(72 pass, 0 fail),cargo check --tests,cargo clippy(only pre-existing warnings).Not yet verified (needs deploy): point a BYO-CH org at an unreachable endpoint on staging → confirm 429 responses + no new "Unknown Error" issue for
service=ingest, and thatingest_backpressure_shed_totalappears. The old fingerprint should stop receiving new events once deployed.🤖 Generated with Claude Code