fix(ingest): reclassify replay backpressure 503s + label error spans by Makisuo · Pull Request #157 · MapleTechLabs/maple

Makisuo · 2026-07-01T09:39:42Z

What & why

Production error fingerprint 2492876445615472305 — 139 × "Unknown Error" (2026-06-30 → 07-01) — was a HTTP 503 unavailable on POST /v1/sessionReplays/blob in the ingest gateway, all for one BYO-ClickHouse org (destination=clickhouse).

Root cause (confirmed against code + exported ingest_* metrics):

The org's BYO-ClickHouse target degraded; the per-org breaker tripped and shed export batches (ingest_clickhouse_export_dropped_total{status="circuit_open"} = 29 + 7,793 rows).
ExportWorker::run is single-threaded per lane: it recv()s a batch then awaits the full retry budget before recv()ing again. Each failing attempt reused the shared HTTP client's 10s forward_timeout, so the initial breaker trip (failure_threshold=4) held the lane ~40s+, plus ~10s per 30s half-open probe.
While the lane was held, new replay frames filled the lane's bounded channel; the accept path (commit_frames) shed them via try_reserve_owned() → PipelineError::Backpressure. That shed path emitted no metric.
The replay handlers blanket-mapped every PipelineError to service_unavailable (503).
The handler set otel.status_code="Error" but never otel.status_description, so StatusMessage was empty. Maple's error_events_mv maps empty StatusMessage → literal 'Unknown Error' and hashes all such spans into one fingerprint. (The error.type attribute the handler does set is not read by the label logic.)

Metrics ruled out the other shed variants for this org: org_throttled=0, org_queue_bytes peaked ~38 MB (« the 1 GiB cap), wal_shard_full{clickhouse}=0 — leaving Backpressure as the residual.

Changes (all in `apps/ingest/src/`)

Reclassify transient backpressure as 429. New shared api_error_from_pipeline() used by both the OTLP signal path and the 3 session-replay handlers: Throttled/Backpressure → 429 (retryable; otel_status_for_rejection → Ok, so it leaves the error view), QueueUnavailable/Encode → 503. Also fixes a latent bug where Throttled on a replay path wrongly returned 503.
Label every ingest error span. Declare otel.status_description on each handler span and record error.message on the Err arm, so any remaining genuine 5xx (WAL failure, storage-not-configured, auth-resolver down, collector transport/5xx) surfaces with a real tag/message instead of "Unknown Error".
ingest_backpressure_shed_total counter at the shed site (org_id/destination/signal) — the absence of this metric is what made the incident hard to diagnose.
Drain fix at the source. Dedicated short ClickHouse export timeout (INGEST_CLICKHOUSE_EXPORT_TIMEOUT_MS, default 3s, was reusing the 10s forward_timeout) applied per-request, and lower breaker failure_threshold 4 → 2. Worst-case lane hold drops ~40s → ~6s, so the channel stays drained and sheds become rare.

Reviewer notes

Backpressure → 429 also changes the OTLP path (traces/logs/metrics), not just replays. Both 429 and 503 are retryable per the OTel spec; this keeps expected load-shedding out of the error dashboards.
Client body messages on the replay paths changed (dropped the "failed to enqueue …: {e}" prefix); the context is preserved in a server warn! log. No test asserts on those body strings.
Per-destination co-shard isolation is preserved — the drain fix only shortens how long a ClickHouse lane holds itself, and keeps the single recovery probe intact.

Tests

New: api_error_from_pipeline_maps_variants_to_status (429/503 mapping + otel_status_for_rejection), breaker_default_failure_threshold_is_two.
Existing clickhouse_breaker_sheds_after_threshold_failures / co-shard drain tests cover Part 4.
cargo test (72 pass, 0 fail), cargo check --tests, cargo clippy (only pre-existing warnings).

Not yet verified (needs deploy): point a BYO-CH org at an unreachable endpoint on staging → confirm 429 responses + no new "Unknown Error" issue for service=ingest, and that ingest_backpressure_shed_total appears. The old fingerprint should stop receiving new events once deployed.

🤖 Generated with Claude Code

Session-replay blob POSTs to a degraded BYO-ClickHouse org surfaced as 139 "Unknown Error" 503s. Root cause: while the ClickHouse export lane stalled (breaker circuit_open, ~7.8k rows dropped), the single-threaded export worker held the lane through its retry budget, so new frames filled the lane's bounded channel and the accept path shed them as PipelineError::Backpressure -> 503. The handler set otel.status_code="Error" but never otel.status_description, so the error MV (which reads StatusMessage, not the error.type attribute) mapped the empty message to the literal "Unknown Error" and collapsed all sheds into one fingerprint. - Reclassify transient Throttled/Backpressure as 429 (retryable, classified Ok so it leaves the error view) via a shared api_error_from_pipeline() used by both the OTLP signal path and the 3 session-replay handlers; QueueUnavailable/ Encode stay 503. Also fixes a latent bug where Throttled on a replay path wrongly returned 503. - Record otel.status_description = error.message on every ingest error span so remaining genuine 5xx surface with a real tag/message, not "Unknown Error". - Add ingest_backpressure_shed_total counter at the shed site (it emitted no metric, which is why this was hard to diagnose). - Bound the lane hold at the source: dedicated short ClickHouse export timeout (INGEST_CLICKHOUSE_EXPORT_TIMEOUT_MS, default 3s, was reusing the 10s forward_timeout) and lower breaker failure_threshold 4 -> 2. Note: Backpressure -> 429 also applies to the OTLP traces/logs/metrics path; both 429 and 503 are retryable per the OTel spec. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

pullfrog · 2026-07-01T09:39:46Z

Your LLM provider API key was rejected. Rotate the key in your provider dashboard, then update the matching GitHub Actions secret.

Update repo secret → · Model settings → · Setup docs → · Ask in Discord →

^{｜ ⚠️ this action is pinned to a commit SHA, which freezes the cleanup step — switch to @v0 or keep the SHA fresh with Dependabot ｜ Rerun failed job ➔ ｜ View workflow run ｜ via Pullfrog ｜ Using Claude Opus ｜ 𝕏}

github-actions · 2026-07-01T09:44:47Z

Ingest Rust Test + Benchmark Results

Commit: ad3ebb895c39dfc6858b4321a681af235cef6db4

Load Benchmark — `tinybird` mode, median of 3 run(s) vs main

Metric	main (median)	PR (median)	Delta
Requests/sec	1876.96	1727.45	-8.0% worse
Rows/sec	18769.59	17274.48	-8.0% worse
p50 latency	33.13 ms	36.13 ms	+9.1% worse
p95 latency	41.46 ms	48.80 ms	+17.7% worse
p99 latency	51.36 ms	65.94 ms	+28.4% worse
Export catch-up	0.026 s	0.026 s	+0.4% worse
Max RSS	104.52 MiB	103.36 MiB	-1.1% better
Failures	0	0	same

Same code path on both sides (same LOAD_TEST_INGEST_MODE), so the delta column is meaningful. Numbers come from ubuntu-latest, which is noisy — treat single-digit-percent deltas as noise.

PR load benchmark JSON (per-iteration)

[
  {
    "ingest_mode": "tinybird",
    "requests": 2000,
    "successes": 2000,
    "failures": 0,
    "rows_sent": 20000,
    "rows_exported": 20000,
    "imports": 26,
    "duration_seconds": 1.269508861,
    "export_catchup_seconds": 0.026087098,
    "request_rps": 1575.4123987953794,
    "row_rps": 15754.123987953795,
    "p50_ms": 38.428,
    "p95_ms": 57.233,
    "p99_ms": 65.935,
    "max_rss_mb": 99.03125,
    "max_cpu_percent": 50.0,
    "avg_cpu_percent": 38.53333333333333
  },
  {
    "ingest_mode": "tinybird",
    "requests": 2000,
    "successes": 2000,
    "failures": 0,
    "rows_sent": 20000,
    "rows_exported": 20000,
    "imports": 26,
    "duration_seconds": 1.157777068,
    "export_catchup_seconds": 0.025640517,
    "request_rps": 1727.4482759059106,
    "row_rps": 17274.482759059105,
    "p50_ms": 34.951,
    "p95_ms": 48.798,
    "p99_ms": 67.379,
    "max_rss_mb": 103.359375,
    "max_cpu_percent": 56.4,
    "avg_cpu_percent": 47.73333333333333
  },
  {
    "ingest_mode": "tinybird",
    "requests": 2000,
    "successes": 2000,
    "failures": 0,
    "rows_sent": 20000,
    "rows_exported": 20000,
    "imports": 26,
    "duration_seconds": 1.15215423,
    "export_catchup_seconds": 0.026406878,
    "request_rps": 1735.8787113075998,
    "row_rps": 17358.787113075996,
    "p50_ms": 36.133,
    "p95_ms": 43.983,
    "p99_ms": 49.128,
    "max_rss_mb": 107.25,
    "max_cpu_percent": 57.1,
    "avg_cpu_percent": 43.06666666666666
  }
]

main load benchmark JSON (per-iteration)

[
  {
    "ingest_mode": "tinybird",
    "requests": 2000,
    "successes": 2000,
    "failures": 0,
    "rows_sent": 20000,
    "rows_exported": 20000,
    "imports": 25,
    "duration_seconds": 1.212915826,
    "export_catchup_seconds": 0.025989565,
    "request_rps": 1648.9190404874807,
    "row_rps": 16489.190404874807,
    "p50_ms": 36.852,
    "p95_ms": 49.731,
    "p99_ms": 58.084,
    "max_rss_mb": 104.51953125,
    "max_cpu_percent": 55.3,
    "avg_cpu_percent": 41.53333333333334
  },
  {
    "ingest_mode": "tinybird",
    "requests": 2000,
    "successes": 2000,
    "failures": 0,
    "rows_sent": 20000,
    "rows_exported": 20000,
    "imports": 26,
    "duration_seconds": 1.065553551,
    "export_catchup_seconds": 0.026375949,
    "request_rps": 1876.9586926185373,
    "row_rps": 18769.586926185373,
    "p50_ms": 33.133,
    "p95_ms": 39.612,
    "p99_ms": 40.575,
    "max_rss_mb": 102.4296875,
    "max_cpu_percent": 59.8,
    "avg_cpu_percent": 50.06666666666666
  },
  {
    "ingest_mode": "tinybird",
    "requests": 2000,
    "successes": 2000,
    "failures": 0,
    "rows_sent": 20000,
    "rows_exported": 20000,
    "imports": 24,
    "duration_seconds": 1.046649844,
    "export_catchup_seconds": 0.025305608,
    "request_rps": 1910.8587379677667,
    "row_rps": 19108.58737967767,
    "p50_ms": 31.689,
    "p95_ms": 41.462,
    "p99_ms": 51.36,
    "max_rss_mb": 107.59375,
    "max_cpu_percent": 60.7,
    "avg_cpu_percent": 45.4
  }
]

WAL-acked microbench (`cargo bench --bench ingest_bench`)

   Compiling maple-ingest v0.1.0 (/home/runner/work/maple/maple/apps/ingest)
    Finished `bench` profile [optimized] target(s) in 40.04s
     Running benches/ingest_bench.rs (target/release/deps/ingest_bench-581d2100de893627)
Gnuplot not found, using plotters backend
test ingest_accept/logs_10_rows_wal_ack ... bench:      578108 ns/iter (+/- 37921)
test ingest_accept/traces_10_spans_wal_ack ... bench:      601082 ns/iter (+/- 51719)

cargo test

test telemetry::tests::metrics_emit_exactly_the_jsonpaths_declared_in_datasources_ts ... ok
test telemetry::tests::metrics_summary_data_points_are_dropped ... ok
test telemetry::tests::migrate_legacy_shard_relocates_frames_into_lanes ... ok
test telemetry::tests::pipeline_can_start_for_clickhouse_only_without_tinybird_credentials ... ok
test telemetry::tests::clickhouse_export_drops_passworded_non_https_endpoint_without_sending ... ok
test telemetry::tests::pipeline_e2e_exports_gzip_ndjson_to_fake_tinybird ... ok
test telemetry::tests::pipeline_e2e_exports_metrics_to_fake_tinybird ... ok
test telemetry::tests::sampling_keeps_errors_even_when_ratio_low ... ok
test telemetry::tests::scraper_contract::scraper_otlp_json_decodes_with_gateway_serde_and_encodes_to_rows ... ok
test telemetry::tests::signal_tag_round_trips_all_variants ... ok
test telemetry::tests::pipeline_e2e_exports_traces_to_fake_tinybird ... ok
test telemetry::tests::telemetry_signal_as_str_is_canonical_lowercase ... ok
test telemetry::tests::timestamp_has_nano_precision ... ok
test telemetry::tests::timestamps_match_clickhouse_datetime64_nine_format ... ok
test telemetry::tests::trace_encoder_matches_tinybird_row_shape ... ok
test telemetry::tests::traces_emit_exactly_the_jsonpaths_declared_in_datasources_ts ... ok
test telemetry::tests::wal_partial_drain_advances_cursor_without_truncating ... ok
test telemetry::tests::wal_round_trips_frame ... ok
test telemetry::tests::wal_truncates_after_full_drain_allowing_further_appends ... ok
test telemetry::tests::pipeline_exports_ready_org_to_clickhouse_without_tinybird_calls ... ok
test telemetry::tests::slow_clickhouse_lane_does_not_block_cosharded_tinybird_org ... ok
test telemetry::tests::clickhouse_breaker_sheds_after_threshold_failures ... ok

test result: ok. 36 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.78s

     Running unittests src/bin/load_test.rs (target/debug/deps/load_test-661a0aa1eb3f6d6d)

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running unittests src/main.rs (target/debug/deps/maple_ingest-c33bf80c577edb95)

running 38 tests
test autumn::tests::allowed_only_no_balance_field ... ok
test autumn::tests::flat_hardcap_with_remaining_allows ... ok
test autumn::tests::flat_hardcap_depleted_blocks ... ok
test autumn::tests::flat_unlimited_allows ... ok
test autumn::tests::flat_overage_allows ... ok
test autumn::tests::flat_sub_one_gb_remaining_still_allows ... ok
test autumn::tests::nested_balance_object_depleted_blocks ... ok
test autumn::tests::nested_balance_object_with_remaining_allows ... ok
test autumn::tests::nested_overage_allows ... ok
test autumn::tests::null_balance_no_subscription_blocks ... ok
test autumn::tests::unrecognized_shape_returns_none ... ok
test tests::api_error_from_pipeline_maps_variants_to_status ... ok
test tests::api_error_kind_maps_status_to_stable_label ... ok
test tests::clickhouse_destination_uses_native_pipeline_even_in_forward_mode ... ok
test tests::clickhouse_destination_is_terminal_in_dual_mode ... ok
test tests::clickhouse_target_resolver_decrypts_current_schema_password ... ok
test tests::clickhouse_target_resolver_requires_current_schema ... ok
test tests::cloudflare_ndjson_payload_parses_multiple_records ... ok
test tests::cloudflare_log_record_maps_body_severity_and_attributes ... ok
test tests::clickhouse_target_resolver_rejects_password_over_http ... ok
test tests::cloudflare_timestamps_support_rfc3339_unix_and_unix_nano ... ok
test tests::decrypt_aes256_gcm_matches_node_crypto_fixture ... ok
test tests::enrichment_overwrites_tenant_fields ... ok
test tests::cloudflare_validation_payload_is_detected ... ok
test tests::extract_ingest_key_returns_sentinel_literal_unchanged ... ok
test tests::hash_is_deterministic ... ok
test tests::rejection_span_status_is_error_only_for_5xx ... ok
test tests::resolve_ingest_key_keeps_stale_schema_on_managed_native_path ... ok
test tests::resolve_connector_refreshes_routing_before_auth_cache_expires ... ok
test tests::resolve_ingest_key_returns_none_when_hash_missing ... ok
test tests::resolve_ingest_key_refreshes_routing_before_auth_cache_expires ... ok
test tests::resolve_ingest_key_returns_self_managed_false_when_no_settings_row ... ok
test tests::resolve_ingest_key_returns_self_managed_true_when_active_settings_row ... ok
test tests::sentinel_token_matches_only_exact_literal ... ok
test tests::tinybird_destination_keeps_forward_mode_on_forward_path ... ok
test autumn::tests::fails_open_on_transport_error ... ok
test tests::resolve_ingest_key_serves_last_known_routing_when_refresh_fails ... ok
test tests::forward_mode_switches_ready_org_to_clickhouse_without_forwarding_again ... ok

test result: ok. 38 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.23s

   Doc-tests maple_ingest

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

…tric Addresses review feedback: ingest_backpressure_shed_total used Debug format (PascalCase "Traces"/"SessionReplays") for the signal label, which cannot correlate with other per-signal metrics that use lowercase. Add TelemetrySignal::as_str() returning the canonical vocabulary ("traces"/"logs"/"metrics"/"session_replays") — matching signal.path() and the maple.signal span attribute — and use it at the shed site. Note "session_replays" (snake_case) is correct here, whereas Debug + to_lowercase would give the inconsistent "sessionreplays". Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The WAL-acked microbench (benches/ingest_bench.rs) constructs TinybirdConfig directly and was missed when the field was added — `cargo test` doesn't compile benches, so CI's `cargo bench` caught it. Set it to 5s to match the other test configs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

devin-ai-integration

Devin Review found 2 new potential issues.

devin-ai-integration · 2026-07-01T09:54:48Z

🚩 OTLP signal path labels all process_decoded_payload errors as error_kind="forward"

In handle_signal_inner (main.rs:2400-2401), errors from process_decoded_payload — including native pipeline rejections (throttle, backpressure, WAL I/O) — are all mapped to error_kind="forward". This means ingest_requests_total{error_kind="forward"} conflates actual collector forward failures with native pipeline rejections. The same pre-existing issue exists in the Cloudflare logpush path (main.rs:2593). The PR doesn't change this mapping, but the status code shift (backpressure 503→429) makes it more important to distinguish these in metrics since the error_kind label no longer correlates 1:1 with a status bucket.

(Refers to lines 2400-2401)

Was this helpful? React with 👍 or 👎 to provide feedback.

…on_replays Addresses review feedback: accept_rows_to() hardcoded TelemetrySignal:: SessionReplays for all row-based data, so session-events frames (and thus the new ingest_backpressure_shed_total signal label + the pre-existing export_batch_completed label) reported signal="session_replays" even though the span carries maple.signal="session_events". Add a SessionEvents variant (as_str="session_events", WAL tag 5) and thread a TelemetrySignal parameter through accept_rows_to/accept_rows so each handler passes its true signal (blob/meta -> SessionReplays, events -> SessionEvents). Adds as_str + signal_tag round-trip tests covering the new variant. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Makisuo had a problem deploying to pr-preview July 1, 2026 09:39 — with GitHub Actions Error

This comment was marked as resolved.

Sign in to view

Makisuo had a problem deploying to pr-preview July 1, 2026 09:46 — with GitHub Actions Error

Makisuo temporarily deployed to pr-preview July 1, 2026 09:48 — with GitHub Actions Inactive

devin-ai-integration Bot reviewed Jul 1, 2026

View reviewed changes

Makisuo temporarily deployed to pr-preview July 1, 2026 10:00 — with GitHub Actions Inactive

Makisuo merged commit 591bd45 into main Jul 1, 2026
8 checks passed

Makisuo deleted the fix/ingest-replay-backpressure-unknown-error branch July 1, 2026 10:12

Makisuo temporarily deployed to pr-preview July 1, 2026 10:12 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(ingest): reclassify replay backpressure 503s + label error spans#157

fix(ingest): reclassify replay backpressure 503s + label error spans#157
Makisuo merged 4 commits into
mainfrom
fix/ingest-replay-backpressure-unknown-error

Makisuo commented Jul 1, 2026 •

edited by devin-ai-integration Bot

Loading

Uh oh!

pullfrog Bot commented Jul 1, 2026 •

edited

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

github-actions Bot commented Jul 1, 2026 •

edited

Loading

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

devin-ai-integration Bot Jul 1, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Makisuo commented Jul 1, 2026 • edited by devin-ai-integration Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What & why

Changes (all in apps/ingest/src/)

Reviewer notes

Tests

Uh oh!

pullfrog Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

github-actions Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Ingest Rust Test + Benchmark Results

Load Benchmark — tinybird mode, median of 3 run(s) vs main

WAL-acked microbench (cargo bench --bench ingest_bench)

cargo test

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Makisuo commented Jul 1, 2026 •

edited by devin-ai-integration Bot

Loading

Changes (all in `apps/ingest/src/`)

pullfrog Bot commented Jul 1, 2026 •

edited

Loading

github-actions Bot commented Jul 1, 2026 •

edited

Loading

Load Benchmark — `tinybird` mode, median of 3 run(s) vs main

WAL-acked microbench (`cargo bench --bench ingest_bench`)