Skip to content

fix(ingest): reclassify replay backpressure 503s + label error spans#157

Merged
Makisuo merged 4 commits into
mainfrom
fix/ingest-replay-backpressure-unknown-error
Jul 1, 2026
Merged

fix(ingest): reclassify replay backpressure 503s + label error spans#157
Makisuo merged 4 commits into
mainfrom
fix/ingest-replay-backpressure-unknown-error

Conversation

@Makisuo

@Makisuo Makisuo commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

What & why

Production error fingerprint 2492876445615472305139 × "Unknown Error" (2026-06-30 → 07-01) — was a HTTP 503 unavailable on POST /v1/sessionReplays/blob in the ingest gateway, all for one BYO-ClickHouse org (destination=clickhouse).

Root cause (confirmed against code + exported ingest_* metrics):

  1. The org's BYO-ClickHouse target degraded; the per-org breaker tripped and shed export batches (ingest_clickhouse_export_dropped_total{status="circuit_open"} = 29 + 7,793 rows).
  2. ExportWorker::run is single-threaded per lane: it recv()s a batch then awaits the full retry budget before recv()ing again. Each failing attempt reused the shared HTTP client's 10s forward_timeout, so the initial breaker trip (failure_threshold=4) held the lane ~40s+, plus ~10s per 30s half-open probe.
  3. While the lane was held, new replay frames filled the lane's bounded channel; the accept path (commit_frames) shed them via try_reserve_owned() → PipelineError::Backpressure. That shed path emitted no metric.
  4. The replay handlers blanket-mapped every PipelineError to service_unavailable (503).
  5. The handler set otel.status_code="Error" but never otel.status_description, so StatusMessage was empty. Maple's error_events_mv maps empty StatusMessage → literal 'Unknown Error' and hashes all such spans into one fingerprint. (The error.type attribute the handler does set is not read by the label logic.)

Metrics ruled out the other shed variants for this org: org_throttled=0, org_queue_bytes peaked ~38 MB (« the 1 GiB cap), wal_shard_full{clickhouse}=0 — leaving Backpressure as the residual.

Changes (all in apps/ingest/src/)

  1. Reclassify transient backpressure as 429. New shared api_error_from_pipeline() used by both the OTLP signal path and the 3 session-replay handlers: Throttled/Backpressure429 (retryable; otel_status_for_rejectionOk, so it leaves the error view), QueueUnavailable/Encode503. Also fixes a latent bug where Throttled on a replay path wrongly returned 503.
  2. Label every ingest error span. Declare otel.status_description on each handler span and record error.message on the Err arm, so any remaining genuine 5xx (WAL failure, storage-not-configured, auth-resolver down, collector transport/5xx) surfaces with a real tag/message instead of "Unknown Error".
  3. ingest_backpressure_shed_total counter at the shed site (org_id/destination/signal) — the absence of this metric is what made the incident hard to diagnose.
  4. Drain fix at the source. Dedicated short ClickHouse export timeout (INGEST_CLICKHOUSE_EXPORT_TIMEOUT_MS, default 3s, was reusing the 10s forward_timeout) applied per-request, and lower breaker failure_threshold 4 → 2. Worst-case lane hold drops ~40s → ~6s, so the channel stays drained and sheds become rare.

Reviewer notes

  • Backpressure → 429 also changes the OTLP path (traces/logs/metrics), not just replays. Both 429 and 503 are retryable per the OTel spec; this keeps expected load-shedding out of the error dashboards.
  • Client body messages on the replay paths changed (dropped the "failed to enqueue …: {e}" prefix); the context is preserved in a server warn! log. No test asserts on those body strings.
  • Per-destination co-shard isolation is preserved — the drain fix only shortens how long a ClickHouse lane holds itself, and keeps the single recovery probe intact.

Tests

  • New: api_error_from_pipeline_maps_variants_to_status (429/503 mapping + otel_status_for_rejection), breaker_default_failure_threshold_is_two.
  • Existing clickhouse_breaker_sheds_after_threshold_failures / co-shard drain tests cover Part 4.
  • cargo test (72 pass, 0 fail), cargo check --tests, cargo clippy (only pre-existing warnings).

Not yet verified (needs deploy): point a BYO-CH org at an unreachable endpoint on staging → confirm 429 responses + no new "Unknown Error" issue for service=ingest, and that ingest_backpressure_shed_total appears. The old fingerprint should stop receiving new events once deployed.

🤖 Generated with Claude Code


Open in Devin Review

Session-replay blob POSTs to a degraded BYO-ClickHouse org surfaced as 139
"Unknown Error" 503s. Root cause: while the ClickHouse export lane stalled
(breaker circuit_open, ~7.8k rows dropped), the single-threaded export worker
held the lane through its retry budget, so new frames filled the lane's bounded
channel and the accept path shed them as PipelineError::Backpressure -> 503. The
handler set otel.status_code="Error" but never otel.status_description, so the
error MV (which reads StatusMessage, not the error.type attribute) mapped the
empty message to the literal "Unknown Error" and collapsed all sheds into one
fingerprint.

- Reclassify transient Throttled/Backpressure as 429 (retryable, classified Ok
  so it leaves the error view) via a shared api_error_from_pipeline() used by
  both the OTLP signal path and the 3 session-replay handlers; QueueUnavailable/
  Encode stay 503. Also fixes a latent bug where Throttled on a replay path
  wrongly returned 503.
- Record otel.status_description = error.message on every ingest error span so
  remaining genuine 5xx surface with a real tag/message, not "Unknown Error".
- Add ingest_backpressure_shed_total counter at the shed site (it emitted no
  metric, which is why this was hard to diagnose).
- Bound the lane hold at the source: dedicated short ClickHouse export timeout
  (INGEST_CLICKHOUSE_EXPORT_TIMEOUT_MS, default 3s, was reusing the 10s
  forward_timeout) and lower breaker failure_threshold 4 -> 2.

Note: Backpressure -> 429 also applies to the OTLP traces/logs/metrics path;
both 429 and 503 are retryable per the OTel spec.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@pullfrog

pullfrog Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Your LLM provider API key was rejected. Rotate the key in your provider dashboard, then update the matching GitHub Actions secret.

Update repo secret → · Model settings → · Setup docs → · Ask in Discord →

Pullfrog  | ⚠️ this action is pinned to a commit SHA, which freezes the cleanup step — switch to @v0 or keep the SHA fresh with Dependabot | Rerun failed job ➔View workflow run | via Pullfrog | Using Claude Opus𝕏

devin-ai-integration[bot]

This comment was marked as resolved.

@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown

Ingest Rust Test + Benchmark Results

Commit: ad3ebb895c39dfc6858b4321a681af235cef6db4

Load Benchmark — tinybird mode, median of 3 run(s) vs main

Metric main (median) PR (median) Delta
Requests/sec 1876.96 1727.45 -8.0% worse
Rows/sec 18769.59 17274.48 -8.0% worse
p50 latency 33.13 ms 36.13 ms +9.1% worse
p95 latency 41.46 ms 48.80 ms +17.7% worse
p99 latency 51.36 ms 65.94 ms +28.4% worse
Export catch-up 0.026 s 0.026 s +0.4% worse
Max RSS 104.52 MiB 103.36 MiB -1.1% better
Failures 0 0 same

Same code path on both sides (same LOAD_TEST_INGEST_MODE), so the delta column is meaningful. Numbers come from ubuntu-latest, which is noisy — treat single-digit-percent deltas as noise.

PR load benchmark JSON (per-iteration)
[
  {
    "ingest_mode": "tinybird",
    "requests": 2000,
    "successes": 2000,
    "failures": 0,
    "rows_sent": 20000,
    "rows_exported": 20000,
    "imports": 26,
    "duration_seconds": 1.269508861,
    "export_catchup_seconds": 0.026087098,
    "request_rps": 1575.4123987953794,
    "row_rps": 15754.123987953795,
    "p50_ms": 38.428,
    "p95_ms": 57.233,
    "p99_ms": 65.935,
    "max_rss_mb": 99.03125,
    "max_cpu_percent": 50.0,
    "avg_cpu_percent": 38.53333333333333
  },
  {
    "ingest_mode": "tinybird",
    "requests": 2000,
    "successes": 2000,
    "failures": 0,
    "rows_sent": 20000,
    "rows_exported": 20000,
    "imports": 26,
    "duration_seconds": 1.157777068,
    "export_catchup_seconds": 0.025640517,
    "request_rps": 1727.4482759059106,
    "row_rps": 17274.482759059105,
    "p50_ms": 34.951,
    "p95_ms": 48.798,
    "p99_ms": 67.379,
    "max_rss_mb": 103.359375,
    "max_cpu_percent": 56.4,
    "avg_cpu_percent": 47.73333333333333
  },
  {
    "ingest_mode": "tinybird",
    "requests": 2000,
    "successes": 2000,
    "failures": 0,
    "rows_sent": 20000,
    "rows_exported": 20000,
    "imports": 26,
    "duration_seconds": 1.15215423,
    "export_catchup_seconds": 0.026406878,
    "request_rps": 1735.8787113075998,
    "row_rps": 17358.787113075996,
    "p50_ms": 36.133,
    "p95_ms": 43.983,
    "p99_ms": 49.128,
    "max_rss_mb": 107.25,
    "max_cpu_percent": 57.1,
    "avg_cpu_percent": 43.06666666666666
  }
]
main load benchmark JSON (per-iteration)
[
  {
    "ingest_mode": "tinybird",
    "requests": 2000,
    "successes": 2000,
    "failures": 0,
    "rows_sent": 20000,
    "rows_exported": 20000,
    "imports": 25,
    "duration_seconds": 1.212915826,
    "export_catchup_seconds": 0.025989565,
    "request_rps": 1648.9190404874807,
    "row_rps": 16489.190404874807,
    "p50_ms": 36.852,
    "p95_ms": 49.731,
    "p99_ms": 58.084,
    "max_rss_mb": 104.51953125,
    "max_cpu_percent": 55.3,
    "avg_cpu_percent": 41.53333333333334
  },
  {
    "ingest_mode": "tinybird",
    "requests": 2000,
    "successes": 2000,
    "failures": 0,
    "rows_sent": 20000,
    "rows_exported": 20000,
    "imports": 26,
    "duration_seconds": 1.065553551,
    "export_catchup_seconds": 0.026375949,
    "request_rps": 1876.9586926185373,
    "row_rps": 18769.586926185373,
    "p50_ms": 33.133,
    "p95_ms": 39.612,
    "p99_ms": 40.575,
    "max_rss_mb": 102.4296875,
    "max_cpu_percent": 59.8,
    "avg_cpu_percent": 50.06666666666666
  },
  {
    "ingest_mode": "tinybird",
    "requests": 2000,
    "successes": 2000,
    "failures": 0,
    "rows_sent": 20000,
    "rows_exported": 20000,
    "imports": 24,
    "duration_seconds": 1.046649844,
    "export_catchup_seconds": 0.025305608,
    "request_rps": 1910.8587379677667,
    "row_rps": 19108.58737967767,
    "p50_ms": 31.689,
    "p95_ms": 41.462,
    "p99_ms": 51.36,
    "max_rss_mb": 107.59375,
    "max_cpu_percent": 60.7,
    "avg_cpu_percent": 45.4
  }
]

WAL-acked microbench (cargo bench --bench ingest_bench)

   Compiling maple-ingest v0.1.0 (/home/runner/work/maple/maple/apps/ingest)
    Finished `bench` profile [optimized] target(s) in 40.04s
     Running benches/ingest_bench.rs (target/release/deps/ingest_bench-581d2100de893627)
Gnuplot not found, using plotters backend
test ingest_accept/logs_10_rows_wal_ack ... bench:      578108 ns/iter (+/- 37921)
test ingest_accept/traces_10_spans_wal_ack ... bench:      601082 ns/iter (+/- 51719)

cargo test

test telemetry::tests::metrics_emit_exactly_the_jsonpaths_declared_in_datasources_ts ... ok
test telemetry::tests::metrics_summary_data_points_are_dropped ... ok
test telemetry::tests::migrate_legacy_shard_relocates_frames_into_lanes ... ok
test telemetry::tests::pipeline_can_start_for_clickhouse_only_without_tinybird_credentials ... ok
test telemetry::tests::clickhouse_export_drops_passworded_non_https_endpoint_without_sending ... ok
test telemetry::tests::pipeline_e2e_exports_gzip_ndjson_to_fake_tinybird ... ok
test telemetry::tests::pipeline_e2e_exports_metrics_to_fake_tinybird ... ok
test telemetry::tests::sampling_keeps_errors_even_when_ratio_low ... ok
test telemetry::tests::scraper_contract::scraper_otlp_json_decodes_with_gateway_serde_and_encodes_to_rows ... ok
test telemetry::tests::signal_tag_round_trips_all_variants ... ok
test telemetry::tests::pipeline_e2e_exports_traces_to_fake_tinybird ... ok
test telemetry::tests::telemetry_signal_as_str_is_canonical_lowercase ... ok
test telemetry::tests::timestamp_has_nano_precision ... ok
test telemetry::tests::timestamps_match_clickhouse_datetime64_nine_format ... ok
test telemetry::tests::trace_encoder_matches_tinybird_row_shape ... ok
test telemetry::tests::traces_emit_exactly_the_jsonpaths_declared_in_datasources_ts ... ok
test telemetry::tests::wal_partial_drain_advances_cursor_without_truncating ... ok
test telemetry::tests::wal_round_trips_frame ... ok
test telemetry::tests::wal_truncates_after_full_drain_allowing_further_appends ... ok
test telemetry::tests::pipeline_exports_ready_org_to_clickhouse_without_tinybird_calls ... ok
test telemetry::tests::slow_clickhouse_lane_does_not_block_cosharded_tinybird_org ... ok
test telemetry::tests::clickhouse_breaker_sheds_after_threshold_failures ... ok

test result: ok. 36 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.78s

     Running unittests src/bin/load_test.rs (target/debug/deps/load_test-661a0aa1eb3f6d6d)

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running unittests src/main.rs (target/debug/deps/maple_ingest-c33bf80c577edb95)

running 38 tests
test autumn::tests::allowed_only_no_balance_field ... ok
test autumn::tests::flat_hardcap_with_remaining_allows ... ok
test autumn::tests::flat_hardcap_depleted_blocks ... ok
test autumn::tests::flat_unlimited_allows ... ok
test autumn::tests::flat_overage_allows ... ok
test autumn::tests::flat_sub_one_gb_remaining_still_allows ... ok
test autumn::tests::nested_balance_object_depleted_blocks ... ok
test autumn::tests::nested_balance_object_with_remaining_allows ... ok
test autumn::tests::nested_overage_allows ... ok
test autumn::tests::null_balance_no_subscription_blocks ... ok
test autumn::tests::unrecognized_shape_returns_none ... ok
test tests::api_error_from_pipeline_maps_variants_to_status ... ok
test tests::api_error_kind_maps_status_to_stable_label ... ok
test tests::clickhouse_destination_uses_native_pipeline_even_in_forward_mode ... ok
test tests::clickhouse_destination_is_terminal_in_dual_mode ... ok
test tests::clickhouse_target_resolver_decrypts_current_schema_password ... ok
test tests::clickhouse_target_resolver_requires_current_schema ... ok
test tests::cloudflare_ndjson_payload_parses_multiple_records ... ok
test tests::cloudflare_log_record_maps_body_severity_and_attributes ... ok
test tests::clickhouse_target_resolver_rejects_password_over_http ... ok
test tests::cloudflare_timestamps_support_rfc3339_unix_and_unix_nano ... ok
test tests::decrypt_aes256_gcm_matches_node_crypto_fixture ... ok
test tests::enrichment_overwrites_tenant_fields ... ok
test tests::cloudflare_validation_payload_is_detected ... ok
test tests::extract_ingest_key_returns_sentinel_literal_unchanged ... ok
test tests::hash_is_deterministic ... ok
test tests::rejection_span_status_is_error_only_for_5xx ... ok
test tests::resolve_ingest_key_keeps_stale_schema_on_managed_native_path ... ok
test tests::resolve_connector_refreshes_routing_before_auth_cache_expires ... ok
test tests::resolve_ingest_key_returns_none_when_hash_missing ... ok
test tests::resolve_ingest_key_refreshes_routing_before_auth_cache_expires ... ok
test tests::resolve_ingest_key_returns_self_managed_false_when_no_settings_row ... ok
test tests::resolve_ingest_key_returns_self_managed_true_when_active_settings_row ... ok
test tests::sentinel_token_matches_only_exact_literal ... ok
test tests::tinybird_destination_keeps_forward_mode_on_forward_path ... ok
test autumn::tests::fails_open_on_transport_error ... ok
test tests::resolve_ingest_key_serves_last_known_routing_when_refresh_fails ... ok
test tests::forward_mode_switches_ready_org_to_clickhouse_without_forwarding_again ... ok

test result: ok. 38 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.23s

   Doc-tests maple_ingest

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

…tric

Addresses review feedback: ingest_backpressure_shed_total used Debug format
(PascalCase "Traces"/"SessionReplays") for the signal label, which cannot
correlate with other per-signal metrics that use lowercase. Add
TelemetrySignal::as_str() returning the canonical vocabulary
("traces"/"logs"/"metrics"/"session_replays") — matching signal.path() and the
maple.signal span attribute — and use it at the shed site. Note "session_replays"
(snake_case) is correct here, whereas Debug + to_lowercase would give the
inconsistent "sessionreplays".

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The WAL-acked microbench (benches/ingest_bench.rs) constructs TinybirdConfig
directly and was missed when the field was added — `cargo test` doesn't compile
benches, so CI's `cargo bench` caught it. Set it to 5s to match the other test
configs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 new potential issues.

Open in Devin Review

Comment thread apps/ingest/src/main.rs

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 OTLP signal path labels all process_decoded_payload errors as error_kind="forward"

In handle_signal_inner (main.rs:2400-2401), errors from process_decoded_payload — including native pipeline rejections (throttle, backpressure, WAL I/O) — are all mapped to error_kind="forward". This means ingest_requests_total{error_kind="forward"} conflates actual collector forward failures with native pipeline rejections. The same pre-existing issue exists in the Cloudflare logpush path (main.rs:2593). The PR doesn't change this mapping, but the status code shift (backpressure 503→429) makes it more important to distinguish these in metrics since the error_kind label no longer correlates 1:1 with a status bucket.

(Refers to lines 2400-2401)

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment thread apps/ingest/src/telemetry.rs
…on_replays

Addresses review feedback: accept_rows_to() hardcoded TelemetrySignal::
SessionReplays for all row-based data, so session-events frames (and thus the
new ingest_backpressure_shed_total signal label + the pre-existing
export_batch_completed label) reported signal="session_replays" even though the
span carries maple.signal="session_events".

Add a SessionEvents variant (as_str="session_events", WAL tag 5) and thread a
TelemetrySignal parameter through accept_rows_to/accept_rows so each handler
passes its true signal (blob/meta -> SessionReplays, events -> SessionEvents).
Adds as_str + signal_tag round-trip tests covering the new variant.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@Makisuo Makisuo merged commit 591bd45 into main Jul 1, 2026
8 checks passed
@Makisuo Makisuo deleted the fix/ingest-replay-backpressure-unknown-error branch July 1, 2026 10:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant