feat(encryption): Stage 6C-1 — §9.1 startup-refusal guards (flag + sidecar state)#778
feat(encryption): Stage 6C-1 — §9.1 startup-refusal guards (flag + sidecar state)#778bootjp wants to merge 3 commits into
Conversation
… + sidecar state) Stage 6C-1 per the PR #762 plan, decomposed in this PR's design-doc update from the original monolithic Stage 6C into four sub-milestones (6C-1 through 6C-4) per dependency footprint: - 6C-1 (this PR): flag + sidecar-state guards - 6C-2: raftengine-integration guards (sidecar/raft index gap, fsync probe, local_epoch exhaustion) - 6C-3: membership-view guards (node_id collision, local_epoch rollback) — bundles with Stage 6D capability fan-out - 6C-4: Phase-2-specific guards (cutover divergence, enable-raft-envelope-before-bootstrap, RPC local_epoch range) — bundles with Stage 6E Phase-2 cutover ## What this PR ships ### internal/encryption/startup.go (new) CheckStartupGuards(StartupConfig) reports the first §9.1 refusal condition that fires: - ErrSidecarPresentWithoutFlag — data dir contains a sidecar (keys.json) but --encryption-enabled is NOT set. Continuing would silently downgrade new writes to cleartext while leaving the prior wrapped DEKs untouched on disk, inverting operator intent. Refuse to start rather than half-honor a prior bootstrap. - ErrKEKRequiredWithFlag — --encryption-enabled set but no --kekFile supplied. The Stage 6B-2 triple gate keeps mutator RPCs unreachable in this posture, but a flag-on / KEK-off node is misconfigured at the operator-intent level. Fail fast at startup rather than discover the mismatch via a halted apply loop later. - ErrKEKMismatch — flag on, KEK loaded, sidecar present, at least one wrapped DEK fails to unwrap under the configured KEK. The classic operator error: --kekFile points at a KEK from a different cluster / environment. Annotates the offending key_id and purpose so the operator's log line points at the right DEK for root cause. The helper reads the sidecar at most once and is safe to call before any other encryption package state is constructed; it does NOT mutate on-disk state. ### internal/encryption/errors.go Three new sentinels: - ErrSidecarPresentWithoutFlag - ErrKEKRequiredWithFlag - ErrKEKMismatch All carry a self-contained error message that names the relevant flag and the recovery path (consistent with §9.1's "Each refusal logs a single, unambiguous error pointing at the relevant flag and runbook section"). ### main.go loadKEKAndRunStartupGuards() pairs the KEK load with CheckStartupGuards in a single helper so run() stays under the cyclop budget. The guards run BEFORE buildShardGroups constructs any Raft engine or storage state, so a misconfigured node refuses to boot rather than halting mid-flight or — worse — silently downgrading encrypted data to cleartext. ### Stage 6C sub-decomposition rationale (design doc) The original monolithic Stage 6C bundled guards with three different dependency footprints (flags+sidecar, raftengine integration, cluster-wide membership view) plus Phase-2-specific guards. Shipping the load-bearing downgrade-prevention guard (6C-1) alone would have been blocked on Stage 7's writer-registry work (the local_epoch rollback peer of ErrLocalEpochExhausted), which is several PRs out. The sub-decomposition lets 6C-1 ship now and unblocks Stage 6D once 6C-1 + 6C-2 have landed. ## Caller audit (semantic changes — signatures) No public function signatures changed. The new CheckStartupGuards / StartupConfig types are net-additive. loadKEKAndRunStartupGuards is new; loadKEKWrapperFromFlag is unchanged and still callable on its own (the helper calls it internally). ## Stage 6B-2 RPC gate posture preserved The Stage 6B-2 triple gate (encryptionMutatorsEnabled) remains the RPC-boundary safety net. This PR adds a process-boundary safety net on top of it. Both layers exist by design: - The RPC gate keeps unreachable mutator paths unreachable when the operator has chosen NOT to participate in the rollout. - The startup gate keeps misconfigured nodes from booting at all when the operator HAS opted in but the on-disk state or KEK config does not match. A flag-on / KEK-off node: - pre-this-PR: would boot, mutator RPCs refused at the RPC gate - post-this-PR: refuses to boot (ErrKEKRequiredWithFlag) A flag-off / sidecar-present node: - pre-this-PR: would boot, silently downgrading new writes to cleartext - post-this-PR: refuses to boot (ErrSidecarPresentWithoutFlag) ## Five-lens self-review 1. Data loss — the guards are net-positive for data safety: they refuse to boot in postures that would otherwise silently downgrade encrypted data to cleartext or land a HaltApply. No write path touched. 2. Concurrency / distributed failures — the guards run at process startup before any goroutine fan-out, before the Raft engine starts. No new shared state; ReadSidecar is side-effect-free. 3. Performance — single os.Stat + one ReadSidecar (only when --encryption-enabled is set AND a sidecar exists on disk) per process start. Zero hot-path impact. 4. Data consistency — ErrKEKMismatch catches the operator error where the wrong KEK would route writes to a DEK that the prior wrapped material cannot be decrypted from. This is the most severe data-consistency footgun in the §9.1 guard set; catching it at startup prevents the run-and-corrupt pattern. 5. Test coverage — 7 sub-cases in TestCheckStartupGuards_* cover all three sentinels in both fire and pass states, plus the I/O error propagation path and the per-key error annotation (purpose=raft) that runbooks grep for. ## Verification - go test -race -timeout=120s ./internal/encryption/... . — PASS - go build ./... — PASS - golangci-lint run . ./internal/encryption/... — 0 issues - Full Jepsen suite — not run (no envelope cutover happens until Stage 6D; this PR only adds startup refusal logic that runs before any encryption-relevant Raft entry) ## Plan Next Stage 6C sub-milestones per the design-doc update in this PR: - 6C-2: raftengine integration guards (ErrSidecarBehindRaftLog, ErrUnsupportedFilesystem startup probe, ErrLocalEpochExhausted) - 6C-3: membership-view guards (ErrNodeIDCollision, ErrLocalEpochRollback) — bundles with Stage 6D - 6C-4: Phase-2-specific guards — bundles with Stage 6E
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThis PR implements startup validation guards for encryption configuration. It adds error contracts, guard orchestration logic with three sequential checks (sidecar-without-flag, KEK-required, KEK-sidecar-mismatch), comprehensive test coverage across allowed and disallowed startup modes, and integrates the guard into main.go process startup before shard group construction. ChangesEncryption Startup Guard Policy
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@claude review Stage 6C-1 opening review. Scope: three §9.1 startup-refusal guards ( Key safety properties to validate:
The sub-decomposition is also up for review (design doc Stage 6 rationale). |
|
Claude finished @bootjp's task in 3m 20s —— View job Stage 6C-1 Review
SummaryThe three §9.1 guards are correctly implemented and the integration in [BUG] Duplicate 6D row in the design doc
The second row is a stale copy-paste artifact. Fix this → [MEDIUM] Non-deterministic error annotation when multiple DEKs fail
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/design/2026_04_29_partial_data_at_rest_encryption.md`:
- Around line 27-28: Duplicate Stage 6D rows (both referencing
`enable-storage-envelope`, §7.1 Phase-1 storage cutover and the Voters ∪
Learners capability gate) conflict because one lists dependencies as `6C-1+6C-2`
and the other as `6C`; remove the outdated/duplicate row that uses `6C-1+6C-2`
and keep the updated row that uses `6C` so only a single Stage 6D entry remains
(preserving the `enable-storage-envelope`, §7.1, and (§6.2 toggle ON) wording).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e0b79fa8-307c-4a8d-a260-1acd1b9bb35e
📒 Files selected for processing (5)
docs/design/2026_04_29_partial_data_at_rest_encryption.mdinternal/encryption/errors.gointernal/encryption/startup.gointernal/encryption/startup_test.gomain.go
There was a problem hiding this comment.
Code Review
This pull request implements the first phase of the encryption startup refusal guards (Stage 6C-1) as outlined in the updated design documentation. It introduces several safety checks to prevent silent downgrades to cleartext, ensure a KEK is provided when encryption is enabled, and verify that the configured KEK can successfully unwrap existing Data Encryption Keys (DEKs). These guards are integrated into the main startup sequence to fail fast before the Raft engine or storage state is initialized. Feedback suggests improving error reporting when parsing key IDs in the sidecar to provide better context during troubleshooting.
| id, parseErr := strconv.ParseUint(idStr, 10, 32) | ||
| if parseErr != nil { | ||
| // validateSidecar would have caught a non-decimal | ||
| // key_id, but be defensive in case ReadSidecar's | ||
| // invariant changes. | ||
| return pkgerrors.Wrapf(ErrKEKMismatch, | ||
| "sidecar=%q key_id=%q: %v", cfg.SidecarPath, idStr, err) | ||
| } |
There was a problem hiding this comment.
When strconv.ParseUint fails, the error returned should ideally include the parsing error (parseErr) to provide more context about why the key_id could not be processed, rather than just the Unwrap error (err). While validateSidecar is expected to catch malformed IDs, being explicit here aids in troubleshooting manual sidecar edits.
id, parseErr := strconv.ParseUint(idStr, 10, 32)
if parseErr != nil {
return pkgerrors.Wrapf(ErrKEKMismatch,
"sidecar=%q key_id=%q (parse error: %v): %v", cfg.SidecarPath, idStr, parseErr, err)
}…icate Stage 6D row)
Claude bot flagged "Required before merge" and coderabbit
flagged Minor on the same finding: the milestone table in
docs/design/2026_04_29_partial_data_at_rest_encryption.md
had two Stage 6D rows with conflicting dependency wording.
- Line 27 (kept): "depends on 6B AND 6C-1+6C-2 for §9.1
guards; bundles 6C-3 for membership-view collision check"
— the updated wording reflecting the 6C-1..6C-4 split.
- Line 28 (removed): "depends on 6B AND 6C for §9.1 guards"
— the older monolithic-6C wording, stale after this PR's
sub-decomposition.
The duplicate would cause rollout-order ambiguity for any
reader using the table as a checklist (Claude's "concern"
in the design-doc review). Deleting the stale row resolves
both findings; the surviving row is the correct one and the
narrative in the "Stage 6 sub-milestone rationale" section
matches it.
Docs-only change. No code paths touched, no caller audit
needed.
|
@claude review Round-1 (commit Finding (claude required-before-merge + coderabbit minor)
Resolution: deleted line 28 ( Docs-only change. Caller audit not applicable. Non-blocking items deferredThe following Claude r1 "Recommended (non-blocking)" items remain open:
Per the loop's P0/P1/high/critical/major scope these are below threshold. Happy to address in a follow-up if reviewers prefer them in this PR. |
|
Claude finished @bootjp's task in 1m 37s —— View job Stage 6C-1 Review — Round 2
Required fix (r1) — confirmed resolved ✓Commit The old "depends on 6B AND 6C" row (monolithic-6C wording) is gone. No structural issues with the rationale section or the sub-decomposition prose. ✓ Deferred non-blocking items — current stateAll four items are still open per bootjp's note above. Status:
The deferral reasoning holds. The Gemini's
|
… test polish PR #778 round-1 code-side fixes (the docs-side fix for the duplicate Stage 6D row was already addressed in f69f13f). ## claude r1 [MEDIUM] — Non-deterministic ErrKEKMismatch annotation guardKEKMatchesSidecar iterated sc.Keys (a Go map) without sorting, so when multiple wrapped DEKs failed to unwrap the reported key_id / purpose in the ErrKEKMismatch annotation was non-deterministic across restarts. The classic operator scenario (wrong --kekFile entirely) makes EVERY wrapped DEK fail, so the first failure picked by map iteration was effectively random. Fix: sortedSidecarKeyIDs() returns the keys in ascending numeric key_id order. Iteration now always picks the smallest key_id that fails, which gives reproducible runbook log correlation across restarts. The Stage 5.1 sidecar invariant (validateSidecar requires decimal-uint32 keys) lets us sort by ParseUint; a malformed entry is sorted to the end via the math.MaxUint32 sentinel so the iteration order stays defined even if invariant drift lets a bad entry through. TestCheckStartupGuards_KEKMismatch_DeterministicAnnotation pins the regression: a sidecar with storage=1 + raft=2 both wrapped under the wrong KEK MUST always annotate key_id=1 / purpose=storage, verified across 10 iterations to catch the randomized map order. ## claude r1 [Recommended] — Custom contains helper Replaced custom `contains` with `strings.Contains`. The original "avoid cluttering the import diff" reasoning was wrong-headed: strings is stdlib and the substring assertion is the only place in the file that needs anything from it. ## claude r1 [Recommended] — newTestKEK filename encoding Replaced `string(rune(seed))` with `fmt.Sprintf("kek-%02x.bin", seed)`. The original encoding produced 2-byte UTF-8 path components for non-ASCII seeds (0xAA → "kek-\xc2\xaa.bin"); the hex encoding keeps the on-disk test artifact names predictable and readable. ## claude r1 [Recommended] — Empty-Wrapped early-pass test TestCheckStartupGuards_EmptyWrappedSkipped locks down the godoc claim that the guard "Returns nil ... when ... the sidecar exists but has no wrapped DEKs". The test writes a sidecar with two SidecarKey entries that BOTH have empty Wrapped bytes (one nil, one empty slice for variation) and asserts the guard returns nil instead of calling KEK.Unwrap on zero bytes (which would fire a misleading ErrKEKMismatch). ## Caller audit (semantic changes) sortedSidecarKeyIDs is new, pure, no external callers yet. guardKEKMatchesSidecar internal signature unchanged. The ErrKEKMismatch error annotation gains determinism but the overall fail-closed semantic is unchanged — callers that only inspect errors.Is(err, ErrKEKMismatch) see no change in behaviour. ## Verification - go test -race -timeout=120s ./internal/encryption/... — PASS - go build ./... — PASS - golangci-lint run ./internal/encryption/... — 0 issues
|
@claude review Round-2 (commit claude r1 [MEDIUM] addressed
claude r1 [Recommended] addressed
Caller audit
|
|
Claude finished @bootjp's task in 1m 8s —— View job Stage 6C-1 Review — Round 3
r1 MEDIUM — confirmed resolved ✓
r1 Recommended — confirmed resolved ✓
Gemini
|
Summary
Stage 6C-1 per the PR #762 plan. Decomposes the original monolithic Stage 6C into four sub-milestones (6C-1 → 6C-4) per dependency footprint, then ships 6C-1: the flag + sidecar-state §9.1 startup-refusal guards.
The Stage 6B-2
encryptionMutatorsEnabledtriple gate is the RPC-boundary safety net; this PR adds the process-boundary safety net on top.Stage 6C sub-decomposition (design doc update)
The original Stage 6C bundled guards with three different dependency footprints. Shipping the load-bearing downgrade-prevention guard (
ErrSidecarPresentWithoutFlag) alone would have been blocked on Stage 7's writer-registry work (thelocal_epochrollback peer). The sub-decomposition lets 6C-1 ship now and unblocks Stage 6D once 6C-1 + 6C-2 land:local_epochexhaustion)node_idcollision,local_epochrollback) — bundles with Stage 6DWhat this PR ships
Three new §9.1 startup-refusal sentinels
ErrSidecarPresentWithoutFlag--encryption-enabledoff — would silently downgrade new writes to cleartextErrKEKRequiredWithFlag--encryption-enabledset but no--kekFile— flag-on / KEK-off is misconfigured at the operator-intent levelErrKEKMismatchErrKEKMismatchannotates the offendingkey_idandpurposeso runbooks can greppurpose="raft"/purpose="storage"to identify the failing DEK.internal/encryption/startup.go(new)Reads the sidecar at most once and is safe to call before any other encryption package state is constructed. Does NOT mutate on-disk state.
main.goloadKEKAndRunStartupGuards()pairs the KEK load withCheckStartupGuardsin a single helper (cyclop budget). Runs BEFOREbuildShardGroups.Posture matrix (before / after this PR)
ErrSidecarPresentWithoutFlagErrKEKRequiredWithFlagErrKEKMismatchCaller audit (semantic changes)
No public function signatures changed.
CheckStartupGuards/StartupConfigare net-additive.loadKEKAndRunStartupGuardsis new and pairs withloadKEKWrapperFromFlag(which stays callable on its own).Five-lens self-review
ReadSidecaris side-effect-free.os.Stat+ oneReadSidecar(only when both--encryption-enabledand a sidecar exist) per process start. Zero hot-path impact.ErrKEKMismatchcatches the most severe operator-error footgun in the §9.1 set: wrong KEK would route writes to a DEK the prior wrapped material cannot decrypt.TestCheckStartupGuards_*cover all three sentinels in fire + pass states, I/O error propagation, and the per-keypurpose=rafterror annotation that runbooks grep for.Test plan
go test -race -timeout=120s ./internal/encryption/... .— PASSgo build ./...— PASSgolangci-lint run . ./internal/encryption/...— 0 issuesPlan
Next Stage 6C sub-milestones per the design-doc update in this PR:
ErrSidecarBehindRaftLog,ErrUnsupportedFilesystemstartup probe,ErrLocalEpochExhausted)ErrNodeIDCollision,ErrLocalEpochRollback) — bundles with Stage 6DSummary by CodeRabbit
New Features
Documentation
Tests