Skip to content

feat(cubestore/raft): hardening — propose timeout + parse-fail drop + disk runbook#2

Merged
agriev merged 2 commits into
ha-mainfrom
prod-hardening-s2
May 7, 2026
Merged

feat(cubestore/raft): hardening — propose timeout + parse-fail drop + disk runbook#2
agriev merged 2 commits into
ha-mainfrom
prod-hardening-s2

Conversation

@agriev
Copy link
Copy Markdown
Owner

@agriev agriev commented May 7, 2026

PR-S2. propose() bounded by CUBESTORE_RAFT_PROPOSE_TIMEOUT_SECS (default 30s); transport recv loop drops connection after 16 consecutive protobuf decode failures; new docs/ha/RAFT-DISK-FULL-RUNBOOK.md. All 101 raft:: tests green.

agriev added 2 commits May 7, 2026 22:19
…CI to ha-main

Pin debian:bookworm-slim and cubejs/rust-builder:bookworm-llvm-18 by
sha256 digest so a tag rewrite upstream can't silently change what we
build. Add scripts/pin-base-images.sh as a tooled refresh path —
intentional roll-forward becomes a reviewable diff.

Also fire the Rust master workflow on ha-main pushes so the HA fork
catches Cargo.lock drift on every merge instead of only at the next
upstream sync.
…ail drop, disk-full runbook

PR-S2 of the production hardening series.

State machine:
- propose() now bounded by CUBESTORE_RAFT_PROPOSE_TIMEOUT_SECS
  (default 30s; 0 disables). A stalled apply path or a partition where
  the local replica isn't actually leader anymore used to block the
  caller — typically an HTTP handler — forever. On timeout the propose
  returns a CubeError so the caller can return 503, retry against the
  current leader, or fail the request.

Transport:
- The recv loop's per-frame protobuf decode used to `continue` forever
  on malformed input. A peer flooding garbage frames could pin the
  task indefinitely. New MAX_CONSECUTIVE_PARSE_FAILS=16 trips the
  loop into Err on the 17th consecutive failure, which closes the
  TCP connection and lets the next message reconnect from a clean
  state. Counter resets on a successful frame so transient
  corruption (one bad packet, no flood) is still tolerated.

Docs:
- New docs/ha/RAFT-DISK-FULL-RUNBOOK.md covers the three drive_ready
  panic sites — they're correct by design (panic > silent commit
  loss) but operators need a clear recovery script. Includes PVC
  sizing math, single-pod recovery, all-routers wedged recovery, and
  the slow-fsync (vs hard-fail) symptom guide.

Tests: full raft:: suite still 101/101 green.
Build: protobuf@21 toolchain on macOS, debian:bookworm-slim on CI.
@agriev agriev merged commit 406fef6 into ha-main May 7, 2026
26 of 28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant