docs(encryption): Stage 6 plan — decompose into 6A-6F + record 5D shipped#762
Conversation
…-6F) Stage 5D (PR #760) merged at 2026-05-16T16:35:03Z. Stage 5E (§5.6 step 1a capability fan-out helper) is marked deferred — its scope is narrowly the auto-batch CLI convenience and does not gate any Stage 6 work. The existing CLI accepts the writer batch explicitly today (PR #759), which is sufficient for hand-authored bootstraps. ## Stage 6 decomposition Stage 6 was originally one row in the milestone table ("3-phase rollout flags + applier wiring", §6.5 + §7.1). The chain of dependencies makes it impractical to land in a single PR. This commit decomposes it into 6A through 6F following the same shipping-order principle as Stage 5 (5A → 5D shipped, 5E deferred): each milestone is independently reviewable, leaves the cluster in a production-safe state, and unblocks the next. - 6A — concrete EncryptionApplier (writer-registry insert + sidecar mutate + keystore update) + FSM WithEncryption wiring + re-enable Proposer/LeaderView in registerEncryptionAdminServer (gated on --encryption-enabled). The load-bearing safety boundary: PR #760 r3 codex P1 established that the mutator wiring is unsafe to land before the applier exists. - 6B — KEK plumbing: --kekFile / --kekUri flags + internal/encryption/kek selection at startup + applier KEK-unwrap path. - 6C — --encryption-enabled + §9.1 startup refusal guards (sidecar present without flag, KEK mismatch, sidecar/raft index gap, fsync support, local_epoch rollback / exhaustion, node_id collision). - 6D — Phase-1 cutover: enable-storage-envelope admin RPC + §7.1 storage cutover + Voters ∪ Learners capability gate. - 6E — Phase-2 cutover: enable-raft-envelope admin RPC + §7.1 raft cutover + raft_envelope_cutover_index sidecar record + raftengine/etcd applyNormalEntry unwrap hook + ErrRaftUnwrapFailed HaltApply path. - 6F — optional --encryption-rotate-on-startup ergonomics flag. The narrative section after the milestone table explains the rationale for each sub-milestone — why 6A is load-bearing, why 6D and 6E must be separate cutovers (intermediate Phase-1-only state is a valid operator dwell), and why 6F is deferred to last. ## Doc-only PR No code changes. Reviewable on its own. Each of 6A–6F will follow as its own PR with its own design constraints + implementation + tests + Jepsen-relevant suites.
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThe design document for "Data-at-rest encryption for elastickv" was updated to reflect current progress: stages 0–5 are now shipped with stage 5E deferred, and the milestone table was extended with detailed entries for stage 5D, 5E, and stage 6 sub-milestones 6A–6F. The "Stage 6 sub-milestone rationale" section was expanded to explain the dependency-ordered plan and per-sub-milestone scope. ChangesData-at-rest encryption design progress
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@claude review Stage 6 planning PR. Doc-only — decomposes the original Stage 6 row into 6A–6F sub-milestones and records 5D shipped. Each of 6A–6F is independently reviewable; 6A is the load-bearing safety boundary (lands the §6.3 applier whose absence was PR #760 r3 codex P1). Looking for review on:
No code paths touched. |
|
Claude finished @bootjp's task in 3m 52s —— View job Stage 6 planning doc review
Q1 — Sub-milestone scope coherenceThe decomposition is solid overall, but three rows have gaps worth tightening before 6A lands as the next PR. 6A: applier + flag, but no KEK unwrapper yet 6A wires The row should specify whether the 6A applier returns a typed error (e.g., 6C: flag description implies the flag is new The 6C row reads Suggested rewording: 6E: two load-bearing pieces from §6.3 and §7.1 are not listed The 6E row covers the
Q2 — Dependency orderingThe 6A → 6B → 6C → 6D → 6E → 6F chain is correct for the primary dependencies. One cross-cut worth documenting explicitly in the rationale: 6B → 6D is load-bearing, not just transitive The current rationale explains 6D as "the actual cutover" and notes it follows 6C, but doesn't call out why 6D must come after 6B specifically. The reason is that §7.1 Phase 1 step 4 ( One potential out-of-order observation: could 6F be merged after 6D?
Q3 — Phase-1 dwell invariantThis is the sharpest issue in the PR. The claim as written is not fully consistent with §7.2 The rationale section says:
Reading §7.2 ("Why we will not support encrypted → cleartext"):
§7.2 specifically addresses §4.2 (raft envelope). But Phase 1 has a parallel one-way characteristic that §7.2 does not call out explicitly: once The dwell is valid; the "rollback" framing is not The Phase-1-only state is genuinely safe as a pause point:
What operators cannot do is revert Phase 1: the Suggested fix Change the rationale text from "operators can dwell in, observe, and rollback from" to "operators can dwell in and observe indefinitely without committing to Phase 2" — and add a sentence noting that Phase 1 itself is also one-way, consistent with §7.2. This prevents an operator reading the design doc from expecting that "staying in Phase 1" gives them a true rollback option and planning their runbook around an escape hatch that doesn't exist. Status check on 5D/5E changesThe 5D status update (shipped, PR #760) and the status header change are accurate. The 5E deferral rationale ("does not gate any Stage 6 work") is correct — the existing explicit-batch CLI from PR #759 is sufficient for hand-authored bootstraps, and the auto-batch convenience can land after 6F without any on-disk or wire-format dependency. Summary
The 6A row and dependency chain are strong enough to start the 6A implementation. The two doc fixes (6C/6E row wording and Phase-1 dwell language) are low-risk edits that can land in this PR or as the first commit of 6A. |
There was a problem hiding this comment.
Code Review
This pull request updates the data-at-rest encryption design document, marking Stage 5 as partially shipped and decomposing Stage 6 into six granular sub-milestones (6A-6F) to ensure incremental, safe delivery. A detailed rationale section was added to explain the dependencies and safety boundaries for each sub-stage. Feedback was provided to include specific section references in the milestone table and to expand the document with zero-downtime cutover strategies and operational risk mitigations for the proposed migrations.
| | 6B | §6.5 KEK plumbing — `--kekFile` / `--kekUri` flags + `internal/encryption/kek` selection at startup + applier KEK-unwrap path | open | — | | ||
| | 6C | §6.5 `--encryption-enabled` + §9.1 startup refusal guards (sidecar present without flag, KEK mismatch, sidecar/raft index gap, filesystem fsync support, `local_epoch` rollback / exhaustion, `node_id` collision) | open | — | | ||
| | 6D | §6.6 `enable-storage-envelope` admin RPC + §7.1 Phase-1 storage cutover (§6.2 toggle ON) + Voters ∪ Learners capability gate | open | — | | ||
| | 6E | §6.6 `enable-raft-envelope` admin RPC + §7.1 Phase-2 raft cutover + `raft_envelope_cutover_index` sidecar record + `internal/raftengine/etcd/engine.go` `applyNormalEntry` unwrap hook activation + `ErrRaftUnwrapFailed` HaltApply path | open | — | |
There was a problem hiding this comment.
For consistency and clarity, consider adding a reference to §6.3 in the scope of this sub-milestone. Furthermore, as this design involves a 'Phase-2 raft cutover' (a production-grade system migration), please ensure the document details strategies for zero-downtime cutover and mitigations for operational risks like rolling upgrade compatibility, as required by repository guidelines.
| | 6E | §6.6 `enable-raft-envelope` admin RPC + §7.1 Phase-2 raft cutover + `raft_envelope_cutover_index` sidecar record + `internal/raftengine/etcd/engine.go` `applyNormalEntry` unwrap hook activation + `ErrRaftUnwrapFailed` HaltApply path | open | — | | |
| | 6E | §6.6 enable-raft-envelope admin RPC + §7.1 Phase-2 raft cutover + raft_envelope_cutover_index sidecar record + internal/raftengine/etcd/engine.go applyNormalEntry unwrap hook activation (§6.3) + ErrRaftUnwrapFailed HaltApply path | open | — | |
References
- When a design document identifies a significant operational risk, such as the inability to perform rolling upgrades, it must also detail potential mitigation strategies, like implementing a temporary 'bridge' or 'proxy' mode.
- When designing a production-grade system migration, the plan must consider and detail strategies for live or zero-downtime cutovers (e.g., blue/green, dual-write proxy, logical replication) to avoid service interruption.
…hase-1 dwell wording claude[bot] round-1 review on cbaa364 surfaced three actionable doc improvements (no blocking issues, plan acceptable to start 6A implementation): 1. **6C row was misleading.** `--encryption-enabled` is introduced in 6A as the wire-mutators gate, not in 6C. The 6C row read as if the flag were new in 6C, which would confuse the 6A vs 6C reviewer split. Rewrote 6C as "Extend --encryption-enabled (introduced in 6A) with §9.1 startup refusal guards" so the boundary is explicit. 2. **6E row omitted two §7.1 Phase-2 pieces.** The original 6E row covered the admin RPC, sidecar record, engine unwrap hook, and ErrRaftUnwrapFailed — but not the coordinator-side wrap-on-propose switch nor the §7.1 steps 1–6 proposal quiescence barrier. Both are load-bearing safety mechanisms for Phase 2: without the barrier, cleartext entries appended at indexes > cutover would fail every apply with a GCM verification error on followers. Added both pieces to the 6E row so the 6E implementer cannot accidentally ship Phase 2 without them. 3. **Phase-1 dwell rationale claimed a non-existent rollback option.** The original text said operators could "dwell in, observe, and rollback from" Phase 1 before committing to Phase 2. §7.2 is unambiguous: encrypted → cleartext is not supported under any phase; the only path back is dump-and-reload. §9.1's sidecar-present guard activates cluster-wide on the first enable-storage-envelope commit, so Phase 1 is also one-way. Rewrote to "dwell in and observe indefinitely before committing to Phase 2" and added an explicit "Phase 1 is not a rollback point" note so operators planning their runbook cannot mistake the dwell for an escape hatch. ## Additional improvements (claude[bot] optional / preventive) - **6A nil-KEK contract**: added explicit "returns typed ErrKEKNotConfigured through the ErrEncryptionApply HaltApply pipeline" wording to the 6A row + rationale bullet, so 6B has a clean API contract to implement against (drop a KEKUnwrapper in, the existing path fills). Documents that ApplyRegistration is fully functional in 6A (no KEK needed); only ApplyBootstrap and ApplyRotation hit the stub. - **6D → 6B load-bearing dependency**: added a separate rationale bullet calling out that §7.1 Phase 1 step 4 implicitly triggers the §5.6 DEK bootstrap when active.storage == 0, which routes through ApplyBootstrap's KEK-unwrap. This is a non-obvious cross-cut: the nominal 6C → 6D ordering is insufficient; 6B must precede 6D regardless of 6C's status. Stating this in the rationale closes the "why not 6C, 6D, 6B?" question before it gets asked in PR review. ## Caller audit Documentation-only PR — no code paths touched, no callers to audit. ## Verification - git diff: only the design doc changed, 44 insertions / 8 deletions. - The local pre-commit hook (make lint) reported 0 issues (no code paths to lint). Plan is now ready to start 6A implementation as the next PR.
|
@claude review Round-2 (commit Round-1 findings addressed
Optional / preventive improvements also applied
VerificationDoc-only — |
|
Claude finished @bootjp's task in 1m 43s —— View job Round-2 review
Verdict: all five round-1 findings correctly addressed
One new issue introduced in round-2Counting error in the 6F rationale bullet (
6A, 6B, 6C, 6D, 6E is five PRs, not four. The range and the count contradict each other. Suggested fix: This is the only defect in the round-2 edits. Everything else — the milestone table, the rationale section, the Phase-1 dwell language — is internally consistent and accurate. Gemini-code-assist suggestion (6E row + zero-downtime)The automated review asked for a §6.3 reference in the 6E scope — that reference is now present inline as "(Phase-2 leader-side §6.3 proposal-payload wrap)". The zero-downtime concern is addressed by the quiescence barrier (§7.1 steps 1–6, now in the 6E row) and rolling-upgrade compatibility is gated by the Voters ∪ Learners capability check in 6D — both mechanisms are now named in the table. No further additions needed there. SummaryThe doc is ready to merge after the one-word fix ( |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1b3cabf772
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| fills the path; until that lands, any operator who | ||
| hand-triggers `BootstrapEncryption` or `RotateDEK` on a | ||
| 6A-only binary with `--encryption-enabled` set hits a clean, | ||
| named error in the gRPC error rather than a process crash. |
There was a problem hiding this comment.
Avoid treating 6A KEK-missing applies as harmless errors
This section says a 6A-only cluster will return a clean gRPC error if an operator triggers BootstrapEncryption/RotateDEK before 6B, but the current FSM contract is fail-closed: kv/fsm_encryption.go wraps ApplyBootstrap/ApplyRotation errors with ErrEncryptionApply, and internal/raftengine/etcd/engine.go halts apply without advancing setApplied when HaltApply() is non-nil. If such an entry is Raft-committed, replicas can repeatedly halt on that index after restart, so this is an availability-bricking path rather than a recoverable request error.
Useful? React with 👍 / 👎.
Two issues on round-2 (commit 1b3cabf): ## codex P1: 6A re-wiring mutators creates the same cluster-halt class as PR #760 Round-2 wrote: "any operator who hand-triggers BootstrapEncryption or RotateDEK on a 6A-only binary with --encryption-enabled set hits a clean, named error in the gRPC error rather than a process crash." Codex correctly flagged this as wrong: - The FSM contract is fail-closed: kv/fsm_encryption.go wraps ApplyBootstrap/ApplyRotation errors with ErrEncryptionApply. - raftengine/etcd/engine.go halts apply without advancing setApplied when HaltApply() is non-nil. - So a 6A-only mutator RPC: handler proposes, entry commits, every replica's FSM apply returns ErrKEKNotConfigured, HaltApply fires, apply loop halts. Restart replays the same entry → halts again. Cluster bricks on that index. This is the same class of failure as PR #760 round-3 codex P1. The fix for #760 was "don't wire mutators until applier exists"; the principled extension is "don't wire mutators until applier AND KEK are both present." ## Fix: revise 6A/6B split - **6A revised**: lands the §6.3 EncryptionApplier as INTERNAL scaffolding only. WithEncryption(applier) is wired at FSM construction so the dispatch path exists, but registerEncryptionAdminServer carries forward Stage 5D's posture — Proposer + LeaderView stay unwired, mutating RPCs continue to refuse with FailedPrecondition. No flag introduced. The applier is reachable only through direct FSM.Apply calls in unit tests. 6A's value is unblocking 6B/6D by providing a real applier to bind KEK and cutover work against. - **6B revised**: lands KEK plumbing AND introduces the --encryption-enabled flag AND re-enables the mutator wiring gated on (--encryption-enabled AND KEKConfigured()). Both conditions must hold, so a binary with the flag set but no KEK source loaded continues to refuse mutators at the RPC boundary — preventing the cluster-halt path codex identified. - **6C revised**: now extends the flag "introduced in 6B" (not 6A) since the introduction moved to 6B. - **6D rationale rewritten**: the previously-special "6D depends on 6B because §7.1 Phase 1 step 4 implicitly triggers ApplyBootstrap" dependency is now a special case of the general "6B gates all mutator wiring" rule. Both 6D and 6E depend on 6B unconditionally — 6B is what makes any EncryptionAdmin mutator reachable. The 6D row drops the long inline dependency note since the rationale section covers it. The defense-in-depth ErrKEKNotConfigured marker in the applier is retained — a future refactor that bypasses the 6B RPC gate should produce a named grep-able error in logs rather than a nil pointer dereference. The marker is no longer the production safety boundary (the RPC gate is); it is a belt-and-suspenders backstop. ## claude[bot] round-2 nit: "four" → "five" The 6F rationale bullet said "after the four load-bearing PRs (6A–6E)". The range 6A–6E is five PRs. Fixed. ## Caller audit Doc-only — no code paths touched, no callers. ## Verification - git diff scope: design doc only, 81 insertions / 53 deletions. - make lint: 0 issues (no code to lint).
|
@claude review Round-3 (commit codex round-2 P1Codex flagged round-2 wording "any operator who hand-triggers BootstrapEncryption or RotateDEK on a 6A-only binary with --encryption-enabled set hits a clean, named error in the gRPC error rather than a process crash" as wrong:
This is the same cluster-halt class as PR #760 r3 codex P1. Fix: revise 6A/6B split
The defense-in-depth claude[bot] round-2 nitfour → five (6A–6E is five PRs). Fixed. Caller auditDoc-only — no code paths touched. Verification
|
|
Claude finished @bootjp's task in 3m 1s —— View job Round-3 review
Verdict: all round-2 findings correctly addressed; one minor precision issue in the 6A rationale
One new precision issue: "reachable only via unit tests" is not quite right
This is accurate for a same-version 6A cluster, but not in a rolling 6A→6B upgrade. This is the correct behavior — a 6A node cannot safely apply an entry it cannot KEK-unwrap, and the fail-closed halt is what you want — but describing the applier as "only reachable via unit tests" elides the mixed-version case. Suggested wording:
Severity: low — the cluster's behavior is correct (fail-closed halt is right), and a 6A-only cluster has no exposure. This only matters for the rolling upgrade runbook. The 6A PR itself can carry the precise language. One clarity note: "6C orthogonal... can ship at any point after 6B"Line 150–151: "the actual shipping dependency is 6A → 6B → 6D → 6E with 6C orthogonal (it strictly extends This is structurally correct (6C has no code dependency on 6D/6E), but the phrase "can ship at any point after 6B" implies 6C could trail 6D. Two of 6C's guards — Suggested addendum: "In practice, 6C should ship before 6D — the KEK-mismatch and sidecar/raft-index-gap guards are most load-bearing during the Phase-1 cutover window." Severity: low — the table order already puts 6C before 6D, and this is a doc-clarity issue rather than a correctness issue. SummaryThe codex P1 fix is correct and complete. The 6A/6B split cleanly separates the FSM-apply scaffolding from the RPC-gate boolean, the double gate ( |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a5754f6027
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| them), but the actual shipping dependency is 6A → 6B → 6D | ||
| → 6E with 6C orthogonal (it strictly extends `--encryption-enabled` |
There was a problem hiding this comment.
Make 6C a prerequisite for encryption cutovers
Treating 6C as optional ordering (6A → 6B → 6D → 6E with 6C “orthogonal”) is unsafe because §9.1’s startup refusal checks are what prevent post-cutover downgrade/misconfiguration states (for example starting with a sidecar but --encryption-enabled off, or restarting from stale local_epoch state that can reissue nonces). If 6D/6E ship before those guards, operators can enable envelopes and then restart into states the design elsewhere assumes are blocked, violating the security and correctness invariants this rollout is meant to enforce.
Useful? React with 👍 / 👎.
Codex r4 P1 (line 150): the r3 rationale described 6C (§9.1
startup refusal guards) as orthogonal/optional with the shipping
order 6A → 6B → 6D → 6E, but the §9.1 invariants are what
prevent post-cutover restart misconfigurations. If 6D/6E ship
before 6C is in the binary:
- An operator who restarts a node with sidecar present but
--encryption-enabled off silently reads cleartext past an
envelope cutover (§9.1 sidecar-present-without-flag guard
not yet wired).
- A binary without local_epoch rollback / exhaustion guards
can reissue a stale local_epoch and recycle a previously-
used GCM nonce (§4.1 invariant violation).
- KEK mismatch / sidecar-raft index gap / fsync-unsupported /
node_id collision states all bypass the documented refusal.
All four are security-class regressions the §9.1 guards exist to
prevent, so 6C is a hard predecessor for the cutover milestones,
not a separable cleanup.
Update the milestone-rationale paragraph to "6A → 6B → 6C → 6D
→ 6E with 6C as a hard prerequisite". Add the dependency text
to the 6D and 6E rows of the milestone table. The
6D-implicit-bootstrap rationale (§7.1 Phase 1 step 4 routing
through ApplyBootstrap's KEK-unwrap) stays correct, and 6C is
added as a second hard predecessor alongside 6B rather than
replacing it.
Proposal-only — no code in this PR.
|
@claude review Round 4 — Codex P1 (line 150): r3 described 6C (§9.1 startup refusal guards) as orthogonal/optional with order 6A → 6B → 6D → 6E, but §9.1 invariants prevent post-cutover restart misconfigurations. Without 6C in the binary:
All four are security-class regressions. 6C is now a hard prerequisite for 6D/6E, not orthogonal. Updated rationale paragraph: "6A → 6B → 6C → 6D → 6E with 6C as a hard prerequisite". Milestone table rows for 6D / 6E both call out the 6C dependency explicitly. Proposal-only. |
|
Claude finished @bootjp's task in 1m 21s —— View job Round-4 review
Verdict: Codex P1 correctly addressed; no new issues introduced
The Round 4 diff is clean: the "6D and 6E both depend on 6B unconditionally" paragraph is replaced with a full security-class explanation of why 6C is a hard prerequisite — sidecar-present-without-flag silently reads cleartext past an envelope cutover, and a stale One carry-over note from R3 (pre-existing, low severity)The 6A table row and rationale still read "reachable only through direct SummaryThe Codex P1 fix is correct and complete. The dependency chain |
…n-entry caveat
claude[bot] round-3 low-severity (line 75–76): the r3 6A bullet
described the applier as 'reachable only through direct FSM.Apply
calls in unit tests', which is accurate for a same-version 6A
cluster but elides the rolling 6A→6B upgrade case.
Because WithEncryption(applier) is wired in 6A, a 6A node
co-resident with a 6B+ node will dispatch any 0x03/0x04 entry
the 6B leader commits to its (KEK-less) applier and halt with
ErrKEKNotConfigured. The halt itself is fail-closed by design —
a 6A replica cannot safely apply an entry it cannot KEK-unwrap —
but operators must not set --encryption-enabled on a 6B+ node
until every Raft member (voters AND learners, same Voters ∪
Learners gate as §7.1) is on 6B+.
This commit:
- Reworded the 6A row to acknowledge the caveat and point at
the rationale section for the full story.
- Reworked the 6A rationale bullet to:
- Replace 'reachable only through... unit tests' with
'operator-inert in a same-version 6A cluster (exercised
only via direct FSM.Apply unit tests)' — preserves the
same-version meaning without claiming the cross-version
case is unreachable.
- Add a new 'Rolling 6A→6B upgrade caveat' paragraph naming
the Voters ∪ Learners gate and pointing forward at 6B's
runbook.
Doc-only — no code paths touched, no callers.
|
@claude review Round-4: 2 commits — user's parallel codex r4 P1 (user's commit
|
|
Claude finished @bootjp's task in 2m 36s —— View job Round-5 review
Verdict: all round-4/r4 findings correctly addressed; no new defects
One pre-existing nuance in the 6C rationale bullet (not introduced in round-5)The 6C rationale bullet reads: "They are independent of the FSM apply path and the cutover RPCs, so they land in their own PR." This sentence correctly explains why 6C is a separate PR (code-level independence). But read alongside the subsequent bullet ("6C as a hard prerequisite, NOT optional/orthogonal"), it could leave a reader uncertain whether "independent of the cutover RPCs" means "no ordering constraint relative to 6D/6E." The code is independent; the safety ordering is not. This pre-dates round-5 and was the root cause of the "orthogonal" reading that codex r4 P1 corrected. The subsequent rationale bullet already contains the authoritative statement. Whether to patch the 6C bullet itself or leave the adjacent bullet as the authoritative fix is judgment call — I'd lean toward a one-sentence addendum ("independent in implementation; ordered before 6D/6E for safety — see the next bullet"), but it is not blocking. Severity: very low / cosmetic — the hard-prerequisite language in the next bullet is unambiguous, and the confusion it risks would have to survive reading two adjacent paragraphs. No new defects in the round-5 commitsBoth commits are precise and narrow:
The five-count (6A–6E = five PRs) remains correct. The 6C flag attribution ("introduced in 6B") remains correct. All round-1 through round-4 findings remain resolved. The doc is ready to merge. 6A implementation can begin against this milestone table. |
… of cutovers' claude[bot] r5 cosmetic note (judgment call, not blocking): the 6C rationale bullet read 'They are independent of the FSM apply path and the cutover RPCs, so they land in their own PR.' That 'independent of the cutover RPCs' phrase, read on its own, could imply 'no ordering constraint relative to 6D/6E' — which is exactly the wrong reading codex r4 P1 corrected. The adjacent bullet contains the authoritative hard-prerequisite statement, but a one-sentence addendum here removes the possible confusion before the reader gets to it. Fix: clarify that the independence is at the implementation level only — the safety ordering is 6C before 6D/6E — and point forward at the next bullet for the security rationale. Doc-only — no code paths touched, no callers.
|
@claude review Round-5 (commit r5 cosmetic noteYour r5 verdict said the doc was ready to merge but flagged this 6C rationale sentence as a judgment-call clarification:
The phrase "independent of the cutover RPCs" — read on its own — could imply "no ordering constraint relative to 6D/6E", which is precisely the wrong reading codex r4 P1 corrected. FixReworded to clarify the independence is implementation-level only, and explicitly point forward at the next bullet for the safety ordering: "The guard code is independent of the FSM apply path and the Removes the possible misreading before it reaches the reader. Caller auditDoc-only — no code paths touched. Verification
|
|
Claude finished @bootjp's task in 51s —— View job Round-6 review
Verdict: r5 cosmetic fix correctly applied; no new issues
The diff is exactly 6 insertions / 3 deletions in the 6C rationale bullet, as stated. The new text is strictly better than my r5 suggestion — the parenthetical "(it runs at startup, before either has a chance to fire)" I did not specify but is a concrete mechanistic explanation for why 6C's code is decoupled from the cutover RPCs. A reader who wonders "what does independent mean here?" gets an immediate answer without having to infer it from the FSM/RPC structure. The two-sentence structure now reads cleanly:
The adjacent bullet ("6C as a hard prerequisite, NOT optional/orthogonal") remains the authoritative statement of the safety ordering. The r5 fix now makes the 6C bullet self-consistent rather than relying on the reader to synthesize across two paragraphs. No carry-over issues remain. The cumulative state of all six rounds:
The doc is ready to merge. 6A implementation can begin against this milestone table. |
… + FSM WithEncryption wiring Second slice of Stage 6A per the PR #762 plan. Wires the §6.3 Applier (landed in part 1) into the FSM construction path with a Pebble-backed WriterRegistryStore adapter so the 0x03 RegisterEncryptionWriter dispatch path is fully functional end-to-end. Mutator-RPC wiring in registerEncryptionAdminServer remains OFF per Stage 5D's posture — re-enabled in Stage 6B alongside KEK plumbing gated on (--encryption-enabled AND KEKConfigured()). ## What this commit ships ### store/encryption_glue.go (additions) - pebbleWriterRegistry — internal adapter satisfying encryption.WriterRegistryStore. Both Get and Set hold dbMu for read only (the apply path's applyMu in kv/fsm.go is the authoritative serialiser for the §4.1 case-dispatch read-modify-write). Set uses pebble.Sync so the row is durable before FSM apply returns — a crash between FSM dispatch returning success and Pebble flushing the WAL would otherwise leave the Raft log claiming the entry applied but the registry row missing on restart, violating §4.1 case 2's monotonic-epoch invariant. - WriterRegistryFor(MVCCStore) factory — type-asserts against *pebbleStore and returns the adapter, or ErrUnsupportedStoreForWriterRegistry if the store is not the Pebble-backed implementation. Construction-time refusal so the failure surface is "binary refuses to start" rather than "FSM apply panics inside Raft loop". - ErrUnsupportedStoreForWriterRegistry — new sentinel. ### store/writer_registry_test.go (new) Three table-driven unit tests: - TestWriterRegistryFor_RoundTrip — fresh Pebble store, missing row reads as (nil, false, nil), Set writes, Get returns byte-identical value. - TestWriterRegistryFor_Overwrite — Set on existing key replaces the value; no MVCC versioning, no append. - TestWriterRegistryFor_RejectsNonPebble — non-Pebble MVCCStore stub returns ErrUnsupportedStoreForWriterRegistry at construction. ### main.go (FSM construction site, line 694) Replaces `sm := kv.NewKvFSMWithHLC(st, clock)` with the applier-wired form: reg, err := store.WriterRegistryFor(st) if err != nil { _ = st.Close() return nil, nil, errors.Wrapf(err, ...) } applier, err := encryption.NewApplier(reg) if err != nil { _ = st.Close() return nil, nil, errors.Wrapf(err, ...) } sm := kv.NewKvFSMWithHLC(st, clock, kv.WithEncryption(applier)) Both error paths Close the store before returning so a construction-time failure doesn't leak the Pebble handle. Adds `internal/encryption` to the import list (the only new import in main.go). ## What this commit does NOT do - registerEncryptionAdminServer is unchanged. Mutating EncryptionAdmin RPCs continue to refuse with FailedPrecondition exactly as Stage 5D left them. The existing TestEncryptionAdmin_MutatingRPCRefusedUntilStage6 test (both `sidecar_empty` and `sidecar_set` subcases) passes unchanged. - The applier's ApplyBootstrap / ApplyRotation paths still return ErrKEKNotConfigured. Stage 6B re-enables mutator wiring AND threads in the KEK unwrapper at the same time so a re-enabled mutator never hits the ErrKEKNotConfigured halt path. ## Stage 5D safety boundary The two PR760-merged regression tests pass on this branch: - TestEncryptionAdminFullNodeID_DistinctPerRaftId — PASS - TestEncryptionAdmin_MutatingRPCRefusedUntilStage6/sidecar_empty — PASS - TestEncryptionAdmin_MutatingRPCRefusedUntilStage6/sidecar_set — PASS - TestRegisterEncryptionAdminServer_Registers — PASS The new FSM wiring path does not weaken any of these. The applier exists internally; the gRPC layer continues to refuse the operator-facing mutators that would route into it. ## Five-lens self-review 1. Data loss — pebbleWriterRegistry uses pebble.Sync; FSM apply only returns success after the row is durable. Construction-time error paths Close the Pebble store on failure (no handle leak). 2. Concurrency / distributed failures — dbMu.RLock is sufficient because the apply path's applyMu in kv/fsm.go serialises all 0x03 dispatches; the read- modify-write pattern in Applier.ApplyRegistration is safe under that lock. 3. Performance — one Get + at most one Set per ApplyRegistration. pebble.Sync adds wal-flush cost, bounded. 4. Data consistency — writer-registry rows live outside the MVCC namespace (no commit timestamps, no MVCC visibility). Pebble Sync semantics match the §4.1 contract that a registered writer's first nonce-issuing write under that DEK is durably preceded by its registry row. 5. Test coverage — 3 new store-package unit tests for the adapter; the part-1 applier tests already cover the applier-side dispatch; Stage 5D regression tests confirm no operator-surface change. ## Verification - go test -race -timeout=60s -run 'TestWriterRegistryFor' ./store/... — PASS - go test -race -timeout=60s -run 'TestApply|TestNewApplier' ./internal/encryption/... — PASS - go test -race -timeout=60s -run 'TestEncryptionAdmin|TestRegisterEncryptionAdminServer' . — PASS - go test -race -timeout=180s -run 'TestEncrypt|TestApplier|TestFSM|TestKvFSM|TestWriterRegistry' ./... — PASS - go build ./... — PASS - golangci-lint run ./... — 0 issues on touched files The full adapter test suite has a pre-existing 120s-timeout flake on race-mode (SQS reaper / HLC lease renewal long- running goroutines visible in the dump). The encryption- specific adapter tests pass cleanly when run in isolation; the flake is unrelated to this PR's changes (no adapter code is touched). Will run the suite again before marking the PR ready for review.
…ption wiring (#765) ## Summary First slice of **Stage 6A** per the [PR #762 plan](#762): the load-bearing safety boundary that lands the §6.3 `EncryptionApplier` so future PRs can re-enable mutating-RPC wiring without bricking the cluster. This PR is doc-driven by the merged Stage 6 plan; specifically the 6A row: > §6.3 `EncryptionApplier` concrete implementation (writer-registry insert + sidecar mutate + keystore update; nil-KEK code paths in `ApplyBootstrap` / `ApplyRotation` return a typed `ErrKEKNotConfigured` defense-in-depth marker until 6B fills it) + `kv.NewKvFSMWithHLC(... WithEncryption(applier))` wiring. **Mutator wiring in `registerEncryptionAdminServer` stays OFF, identical to Stage 5D**. ## What this PR ships ### Part 1 (this commit, `e2475b84`) — applier package - **`internal/encryption/applier.go`** (new) — concrete `Applier` satisfying `kv.EncryptionApplier`: - `WriterRegistryStore` interface (abstract Get/Set over the `!encryption|writers|<dek_id>|<uint16(node_id)>` Pebble namespace; kept separate from `store.MVCCStore` because writer-registry rows live outside MVCC). - `ApplyRegistration` implements the §4.1 four-case dispatch (insert / monotonic advance / rollback halt / uint16-collision halt). - `ApplyBootstrap` and `ApplyRotation` return the typed `ErrKEKNotConfigured` defense-in-depth marker — Stage 6B will replace them with the real KEK-unwrap + sidecar + keystore path. - **`internal/encryption/errors.go`** — add `ErrKEKNotConfigured` sentinel with godoc referencing the PR #762 safety-boundary decomposition. - **`internal/encryption/applier_test.go`** (new) — 7 unit tests covering all four §4.1 cases, construction-time nil-registry refusal, store-error propagation, and the 6B-deferred stub contracts. Uses a map-backed `WriterRegistryStore` fake. ### Part 2 (follow-up commit in this PR) — FSM wiring - Concrete `WriterRegistryStore` implementation backed by the FSM's Pebble store (`store/encryption_glue.go` or new file). - `kv.NewKvFSMWithHLC(... kv.WithEncryption(applier))` wiring at `main.go:694`. The Part 1 commit compiles and tests pass on its own; Part 2 plumbs it into production. ## What this PR does NOT do `registerEncryptionAdminServer` is **untouched**. Mutating EncryptionAdmin RPCs continue to refuse with `FailedPrecondition` exactly as Stage 5D left them. The Stage 6 plan's safety invariant — do not re-enable mutator wiring until both the applier AND a KEK unwrapper are present — holds: this PR lands the applier (and the FSM dispatch is no longer a halt path for `ApplyRegistration`), but the KEK plumbing (6B) is still missing. 6B will re-enable the mutator wiring gated on (`--encryption-enabled` AND `KEKConfigured()`), both conditions required. ## Caller audit - `kv.EncryptionApplier` interface — implemented by `*encryption.Applier`. No other callers; the FSM dispatch is the sole consumer. - `internal/encryption/errors.go` exported sentinels — `ErrKEKNotConfigured` is new; no callers yet outside `applier.go` (the kv/fsm dispatch will wrap any non-nil return with `ErrEncryptionApply` regardless of the inner marker). ## Five-lens self-review 1. **Data loss** — applier's writes go through `WriterRegistryStore.SetRegistryRow`, which the production wiring (Part 2) routes through `pebble.Sync`. No silent skip path: every error return is propagated; the `kv/fsm` dispatch marks them with `ErrEncryptionApply` for HaltApply. 2. **Concurrency / distributed failures** — Applier carries no in-memory state between calls; the `WriterRegistryStore` is the single source of truth. FSM apply already serialises `0x03` entries via `applyMu`, so the §4.1 case-dispatch read-modify-write is safe under that lock. 3. **Performance** — `ApplyRegistration` does one Get + at most one Set per call. Production Pebble will add the sync cost; bounded. 4. **Data consistency** — §4.1 case 1/2/3/4 dispatch matches the design doc. Case 3 (rollback) halts apply; the 6C startup guard will prevent it from being reachable in production. Case 4 (uint16 collision) similarly halts and is the per-node backstop for §6.1's cluster-wide uniqueness check. 5. **Test coverage** — 7 unit tests covering all four §4.1 cases, construction refusal, store-error propagation, and the 6B-deferred stub contracts. Map-backed fake mirrors the production semantics. ## Test plan - [x] `go test -race -timeout=60s ./internal/encryption/...` — PASS - [x] `go build ./...` — PASS - [x] `golangci-lint run ./internal/encryption/...` — 0 issues - [ ] After Part 2 commit: `go test -race ./...` (full suite) — Will run before requesting review - [ ] After Part 2 commit: Re-run jepsen-relevant suite (none of 6A's behaviour is operator-visible, so no envelope-cutover Jepsen workload applies; documented in PR description) ## Plan: 6A → 6B → 6C → 6D → 6E → 6F This is the first of six PRs per [PR #762](#762). 6B will follow with the KEK plumbing + mutator gate. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Integrated encryption into core system operations with enhanced writer registration and persistence. * Added encryption configuration validation and detailed error reporting for setup issues. * **Tests** * Comprehensive test coverage for encryption functionality, data storage, and integration scenarios. * Verification tests for encryption configuration and startup validation. * **Chores** * Enhanced application startup sequence with encryption support integration. <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/bootjp/elastickv/pull/765?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…on with KEK plumbing (#768) ## Summary Stage 6B part 1 per the [PR #762 plan](#762). Lands the real `ApplyBootstrap` and `ApplyRotation` paths in `internal/encryption.Applier` — KEK-unwrap + Keystore install + crash-durable sidecar mutate + writer-registry batch insert. Replaces the Stage 6A `ErrKEKNotConfigured` stubs. Mutator-RPC wiring in `registerEncryptionAdminServer` stays OFF (Stage 5D posture) so the operator surface is unchanged. The follow-up **6B-2** PR threads `--kekFile` / `--encryption-enabled` flags + startup KEK loader + keystore re-populate + re-enables mutator wiring gated on `(--encryption-enabled AND KEKConfigured())`. ## API additions - **`KEKUnwrapper` interface** — local declaration so the encryption package does not pick up a transitive dependency on the kek package. `kek.Wrapper` satisfies it structurally. - **`ApplierOption` functional-options**: `WithKEK`, `WithKeystore`, `WithSidecarPath`, `WithNowFunc`. The Stage 6A caller `NewApplier(reg)` continues to work byte-for-byte. ## Real ApplyBootstrap (§5.6 step 1a) 1. KEK-unwrap storage + raft wrapped DEKs. 2. `Keystore.Set` both (idempotent on matching bytes; `ErrKeyConflict` on mismatch → halts). 3. `WriteSidecar` with `Active.{Storage,Raft}` set + `keys[]` map updated; crash-durable via §5.1 atomic write. 4. Batch-insert every `BatchRegistry` row via `ApplyRegistration` (§4.1 case-2-idempotent makes replays no-op). Returns `ErrKEKNotConfigured` if any of `WithKEK` / `WithKeystore` / `WithSidecarPath` is missing — the three are an indivisible trio. **Crash-recovery ordering**: `Keystore.Set` (cheap, in-memory) → `WriteSidecar` (durable IO) → batch insert. Recovery is Raft replaying the same entry; every step is idempotent for matching inputs. ## Real ApplyRotation (§5.2 RotateDEK) Handles `RotateSubRotateDEK` only; other sub-tags (rewrap-deks, retire-dek, enable-storage-envelope, enable-raft-envelope) halt with `ErrEncryptionApply` so 6B-1 cannot silently mis-apply later sub-tags. For `RotateSubRotateDEK`: 1. KEK-unwrap the proposed wrapped DEK. 2. `Keystore.Set` at the new DEK ID. 3. `WriteSidecar`: Active slot updated for `Purpose` (Storage or Raft only — unknown Purpose halts); keys[] map updated; old DEKs preserved (retire is a separate Stage 6E sub-tag). 4. `ApplyRegistration` on the proposer's row. ## Tests 8 new applier tests covering happy paths + fail-closed boundaries: | Test | What it pins | |---|---| | `TestApplyBootstrap_ReturnsKEKNotConfigured` | No-options default returns `ErrKEKNotConfigured` | | `TestApplyBootstrap_FullyWired` | Happy path; factor'd per-aspect assertions | | `TestApplyBootstrap_PartialWiring` (4 sub-cases) | All-or-nothing trio invariant | | `TestApplyBootstrap_KEKUnwrapFailure` | Synthetic Unwrap error propagated; no partial install | | `TestApplyBootstrap_Replay_Idempotent` | Same payload twice succeeds | | `TestApplyRotation_ReturnsKEKNotConfigured` | No-options default | | `TestApplyRotation_FullyWired` | Happy path with pre-bootstrap | | `TestApplyRotation_UnknownSubTag` | Halts on `0xFE` sub-tag | | `TestApplyRotation_UnknownPurpose` | Halts on unrecognised Purpose enum | Tests use a `fakeKEK` that pads/truncates wrapped input to `encryption.KeySize` so `Keystore.Set`'s size check passes; same wrapped input always produces the same DEK bytes (deterministic), exercising idempotency. ## Caller audit (semantic change — `NewApplier` signature) `NewApplier(registry)` → `NewApplier(registry, opts...)`. Variadic; the no-options call (Stage 6A test caller) still compiles and behaves identically. Audited: ``` grep -rn 'NewApplier(' --include='*.go' ``` - `main.go:711` — production caller; unchanged in this PR (still passes no opts). 6B-2 will add `WithKEK + WithKeystore + WithSidecarPath`. - `internal/encryption/applier_test.go` — test callers; existing `NewApplier(newMapRegistryStore())` calls still work; new tests pass opts. No other callers. ## Five-lens self-review 1. **Data loss** — `WriteSidecar` uses the existing §5.1 crash-durable write protocol. `Keystore.Set` is in-memory only but the FSM apply path is replayed from Raft on restart. No silent skip: every error propagates through the HaltApply seam. 2. **Concurrency / distributed failures** — Applier carries no per-call mutable state. FSM apply already serialises `0x03/0x04/0x05` dispatches under `applyMu`. Keystore is internally locked. Sidecar IO is single-writer per §5.1. 3. **Performance** — `ApplyBootstrap`: 2 KEK unwraps + 2 `Keystore.Set` + 1 `WriteSidecar` + N writer-registry inserts (N = bootstrap-time cluster size). `ApplyRotation`: 1+1+1+1. Bounded. 4. **Data consistency** — sidecar mutations crash-durable; install order deterministic on replay; case-2-idempotent makes registry replay safe. 5. **Test coverage** — 8 unit tests; both happy paths and fail-closed boundaries exercised. The fakeKEK is deterministic so replay tests are reproducible. ## Test plan - [x] `go test -race -timeout=60s ./internal/encryption/...` — PASS - [x] `go build ./...` — PASS - [x] `golangci-lint run ./internal/encryption/...` — 0 issues - [x] Existing kv + store + main encryption tests — PASS - [ ] Re-run full suite (kv + store + adapter + main + Jepsen) once 6B-2 wires it in production — Will follow with 6B-2 ## Plan This is **Stage 6B part 1**. **Stage 6B-2** will: - Add `--kekFile` / `--kekUri` flags - Add `--encryption-enabled` flag (default off) - Load KEK at startup, populate Keystore from sidecar - Re-enable Proposer + LeaderView in `registerEncryptionAdminServer` gated on (`--encryption-enabled` AND `KEKConfigured()`) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Encryption key management bootstrap and rotation operations are now fully operational, including proper key unwrapping, secure keystore installation, and configurable options for encryption setup. * **Tests** * Comprehensive test coverage added for bootstrap and rotation scenarios, including idempotency validation, failure handling, and configuration verification. <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/bootjp/elastickv/pull/768?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…e scope limitation codex r2 P1 on round-2 (commit 0f946c1): the triple-gate encryptionMutatorsEnabled() is purely node-local; during a staged rolling upgrade where some peers still run the pre-6B-2 binary or have the flags unset, a mutator proposal committed by a fully-configured leader would reach those replicas' appliers and HaltApply on the under-configured side. ## Resolution This is a real concern but the cluster-wide readiness check belongs at the §7.1 Phase-1 cutover RPC layer (enable-storage-envelope) in Stage 6D, not in the node-local flag gate. PR #762's milestone breakdown explicitly assigns the Voters ∪ Learners capability fan-out gate to Stage 6D. Adding a partial cluster-wide check in Stage 6B-2 would duplicate the §7.1 capability gate and ship before the operator-facing GetCapability fan-out is wired into the admin CLI. The right resolution is to document the scope limitation explicitly so: 1. Operators reading the code understand the gate's scope and the rolling-upgrade discipline expected at this layer (same constraint as PR #765's 6A→6B caveat). 2. The Stage 6D PR has a clear handoff point — the cluster-wide gate is its scope, and this docstring points forward to that. ## Doc-only No code paths changed; the gate's three-flag readback is unchanged. Single 30-line docstring expansion explaining: - the gate is THIS NODE local - what cluster-wide readiness means at §7.1 - why Stage 6B-2 deliberately defers it to Stage 6D - the operator-discipline constraint matching PR #765 ## Caller audit Pure doc edit. No callers to audit. ## Verification - go build ./... — PASS - go test -race -timeout=60s -run 'TestEncryptionAdmin|TestRegisterEncryptionAdminServer' . — PASS - golangci-lint run ./... — 0 issues on touched files
Summary
3-phase rollout flags + applier wiring) into six sub-milestones 6A–6F, each independently reviewable, leaving the cluster in a production-safe state, unblocking the next.Stage 6 breakdown
EncryptionApplierconcrete implementation (writer-registry insert + sidecar mutate + keystore update) +kv.NewKvFSMWithHLC(... WithEncryption(applier))wiring + re-enable mutating-RPC wiring inregisterEncryptionAdminServer(gated on--encryption-enabled)--kekFile/--kekUriflags +internal/encryption/kekselection at startup + applier KEK-unwrap path--encryption-enabled+ §9.1 startup refusal guards (sidecar present without flag, KEK mismatch, sidecar/raft index gap, fsync support,local_epochrollback / exhaustion,node_idcollision)enable-storage-envelopeadmin RPC + §7.1 Phase-1 storage cutover (§6.2 toggle ON) + Voters ∪ Learners capability gateenable-raft-envelopeadmin RPC + §7.1 Phase-2 raft cutover +raft_envelope_cutover_indexsidecar record +internal/raftengine/etcd/engine.goapplyNormalEntryunwrap hook +ErrRaftUnwrapFailedHaltApply path--encryption-rotate-on-startuprequest flag + leader-elected rotation proposalWhy doc-first
CLAUDE.md design-doc-first workflow: "For any change that goes beyond a single-file edit ... write a
*_proposed_*.mddesign doc first and land it before the implementation." The existing partial doc already covers Stage 6 at the high level — this PR refines the milestone table so each of 6A–6F has a clear, independently-reviewable scope before implementation begins. 6A will land as the next PR.Five-lens self-review
--encryption-enabledgate).Test plan
go build ./...not required.go test ./...not required.make lintnot required (the local pre-commit hook ran and reported 0 issues).Summary by CodeRabbit
Release Notes