feat(cubestore/raft): hardening — propose timeout + parse-fail drop + disk runbook#2
Merged
Conversation
…CI to ha-main Pin debian:bookworm-slim and cubejs/rust-builder:bookworm-llvm-18 by sha256 digest so a tag rewrite upstream can't silently change what we build. Add scripts/pin-base-images.sh as a tooled refresh path — intentional roll-forward becomes a reviewable diff. Also fire the Rust master workflow on ha-main pushes so the HA fork catches Cargo.lock drift on every merge instead of only at the next upstream sync.
…ail drop, disk-full runbook PR-S2 of the production hardening series. State machine: - propose() now bounded by CUBESTORE_RAFT_PROPOSE_TIMEOUT_SECS (default 30s; 0 disables). A stalled apply path or a partition where the local replica isn't actually leader anymore used to block the caller — typically an HTTP handler — forever. On timeout the propose returns a CubeError so the caller can return 503, retry against the current leader, or fail the request. Transport: - The recv loop's per-frame protobuf decode used to `continue` forever on malformed input. A peer flooding garbage frames could pin the task indefinitely. New MAX_CONSECUTIVE_PARSE_FAILS=16 trips the loop into Err on the 17th consecutive failure, which closes the TCP connection and lets the next message reconnect from a clean state. Counter resets on a successful frame so transient corruption (one bad packet, no flood) is still tolerated. Docs: - New docs/ha/RAFT-DISK-FULL-RUNBOOK.md covers the three drive_ready panic sites — they're correct by design (panic > silent commit loss) but operators need a clear recovery script. Includes PVC sizing math, single-pod recovery, all-routers wedged recovery, and the slow-fsync (vs hard-fail) symptom guide. Tests: full raft:: suite still 101/101 green. Build: protobuf@21 toolchain on macOS, debian:bookworm-slim on CI.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR-S2. propose() bounded by CUBESTORE_RAFT_PROPOSE_TIMEOUT_SECS (default 30s); transport recv loop drops connection after 16 consecutive protobuf decode failures; new docs/ha/RAFT-DISK-FULL-RUNBOOK.md. All 101 raft:: tests green.