docs(encryption): Stage 6 plan — decompose into 6A-6F + record 5D shipped by bootjp · Pull Request #762 · bootjp/elastickv

bootjp · 2026-05-16T16:40:53Z

Summary

Record Stage 5D (PR feat(encryption): Stage 5 PR-D — main.go gRPC wiring for EncryptionAdmin #760, merged 2026-05-16T16:35:03Z) as shipped in the milestone table.
Mark Stage 5E (§5.6 step 1a capability fan-out helper) as deferred — its scope is narrowly the auto-batch CLI convenience and does not gate any Stage 6 work.
Decompose the original Stage 6 row (3-phase rollout flags + applier wiring) into six sub-milestones 6A–6F, each independently reviewable, leaving the cluster in a production-safe state, unblocking the next.
Add a Stage 6 sub-milestone rationale section explaining why the decomposition follows this shape — why 6A is the load-bearing safety boundary (it lands the §6.3 applier whose absence was PR feat(encryption): Stage 5 PR-D — main.go gRPC wiring for EncryptionAdmin #760 round-3 codex P1), why 6D/6E must be separate cutovers (intermediate Phase-1 storage-only state is a valid operator dwell), and why 6F is the last ergonomics PR.

Stage 6 breakdown

Sub-milestone	Scope
6A	§6.3 `EncryptionApplier` concrete implementation (writer-registry insert + sidecar mutate + keystore update) + `kv.NewKvFSMWithHLC(... WithEncryption(applier))` wiring + re-enable mutating-RPC wiring in `registerEncryptionAdminServer` (gated on `--encryption-enabled`)
6B	§6.5 KEK plumbing — `--kekFile` / `--kekUri` flags + `internal/encryption/kek` selection at startup + applier KEK-unwrap path
6C	§6.5 `--encryption-enabled` + §9.1 startup refusal guards (sidecar present without flag, KEK mismatch, sidecar/raft index gap, fsync support, `local_epoch` rollback / exhaustion, `node_id` collision)
6D	§6.6 `enable-storage-envelope` admin RPC + §7.1 Phase-1 storage cutover (§6.2 toggle ON) + Voters ∪ Learners capability gate
6E	§6.6 `enable-raft-envelope` admin RPC + §7.1 Phase-2 raft cutover + `raft_envelope_cutover_index` sidecar record + `internal/raftengine/etcd/engine.go` `applyNormalEntry` unwrap hook + `ErrRaftUnwrapFailed` HaltApply path
6F	§6.5 `--encryption-rotate-on-startup` request flag + leader-elected rotation proposal

Why doc-first

CLAUDE.md design-doc-first workflow: "For any change that goes beyond a single-file edit ... write a *_proposed_*.md design doc first and land it before the implementation." The existing partial doc already covers Stage 6 at the high level — this PR refines the milestone table so each of 6A–6F has a clear, independently-reviewable scope before implementation begins. 6A will land as the next PR.

Five-lens self-review

Data loss — n/a (doc-only).
Concurrency / distributed failures — n/a (doc-only). The breakdown itself is concurrency-safe: 6A leaves the cluster in the same production-inert state PR feat(encryption): Stage 5 PR-D — main.go gRPC wiring for EncryptionAdmin #760 left it in (mutators OFF by default via the new --encryption-enabled gate).
Performance — n/a (doc-only).
Data consistency — n/a (doc-only). The narrative explicitly notes that the Phase-1-only intermediate state (storage encrypted, raft cleartext) is a valid operator dwell, which is the consistency invariant operators rely on for rollback.
Test coverage — n/a (doc-only). Each of 6A–6F will land its own unit + property + Jepsen coverage as appropriate.

Test plan

No code touched — go build ./... not required.
No code touched — go test ./... not required.
No code touched — make lint not required (the local pre-commit hook ran and reported 0 issues).
Manually re-read the updated milestone table for internal consistency.

Summary by CodeRabbit

Release Notes

Documentation
- Updated milestone progress tracking for data-at-rest encryption feature development
- Extended stage breakdown with implementation details for stages 5D through 6F
- Clarified stage 6 sub-milestone planning and dependencies

…-6F) Stage 5D (PR #760) merged at 2026-05-16T16:35:03Z. Stage 5E (§5.6 step 1a capability fan-out helper) is marked deferred — its scope is narrowly the auto-batch CLI convenience and does not gate any Stage 6 work. The existing CLI accepts the writer batch explicitly today (PR #759), which is sufficient for hand-authored bootstraps. ## Stage 6 decomposition Stage 6 was originally one row in the milestone table ("3-phase rollout flags + applier wiring", §6.5 + §7.1). The chain of dependencies makes it impractical to land in a single PR. This commit decomposes it into 6A through 6F following the same shipping-order principle as Stage 5 (5A → 5D shipped, 5E deferred): each milestone is independently reviewable, leaves the cluster in a production-safe state, and unblocks the next. - 6A — concrete EncryptionApplier (writer-registry insert + sidecar mutate + keystore update) + FSM WithEncryption wiring + re-enable Proposer/LeaderView in registerEncryptionAdminServer (gated on --encryption-enabled). The load-bearing safety boundary: PR #760 r3 codex P1 established that the mutator wiring is unsafe to land before the applier exists. - 6B — KEK plumbing: --kekFile / --kekUri flags + internal/encryption/kek selection at startup + applier KEK-unwrap path. - 6C — --encryption-enabled + §9.1 startup refusal guards (sidecar present without flag, KEK mismatch, sidecar/raft index gap, fsync support, local_epoch rollback / exhaustion, node_id collision). - 6D — Phase-1 cutover: enable-storage-envelope admin RPC + §7.1 storage cutover + Voters ∪ Learners capability gate. - 6E — Phase-2 cutover: enable-raft-envelope admin RPC + §7.1 raft cutover + raft_envelope_cutover_index sidecar record + raftengine/etcd applyNormalEntry unwrap hook + ErrRaftUnwrapFailed HaltApply path. - 6F — optional --encryption-rotate-on-startup ergonomics flag. The narrative section after the milestone table explains the rationale for each sub-milestone — why 6A is load-bearing, why 6D and 6E must be separate cutovers (intermediate Phase-1-only state is a valid operator dwell), and why 6F is deferred to last. ## Doc-only PR No code changes. Reviewable on its own. Each of 6A–6F will follow as its own PR with its own design constraints + implementation + tests + Jepsen-relevant suites.

coderabbitai · 2026-05-16T16:41:05Z

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 31 minutes and 23 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3e51b71e-4249-4315-8e6e-038ce89c9526

📥 Commits

Reviewing files that changed from the base of the PR and between cbaa364 and 9c3dcea.

📒 Files selected for processing (1)

docs/design/2026_04_29_partial_data_at_rest_encryption.md

📝 Walkthrough

Walkthrough

The design document for "Data-at-rest encryption for elastickv" was updated to reflect current progress: stages 0–5 are now shipped with stage 5E deferred, and the milestone table was extended with detailed entries for stage 5D, 5E, and stage 6 sub-milestones 6A–6F. The "Stage 6 sub-milestone rationale" section was expanded to explain the dependency-ordered plan and per-sub-milestone scope.

Changes

Data-at-rest encryption design progress

Layer / File(s)	Summary
Milestone progress and stage 6 breakdown `docs/design/2026_04_29_partial_data_at_rest_encryption.md`	Top-level status updated to reflect stages 0–5 shipped and stage 5E deferred. Milestone table extended with stage 5D (gRPC wiring, shipped), stage 5E (deferred), and stages 6A–6F (open) with detailed implementation scope. Expanded rationale section provides dependency ordering and explains what each sub-milestone enables.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Possibly related PRs

bootjp/elastickv#707: Introduces the foundational encryption-at-rest design specification that this PR updates with current milestone progress.

Poem

🐰 Milestones march along the path,
Five stages shipped with steady math,
Stage six now breaks to A through F,
Each one planned with careful heft!
Progress tracked, the way ahead so clear. ✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically summarizes the main changes: recording Stage 5D as shipped and decomposing Stage 6 into six sub-milestones (6A-6F).
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/encryption-stage6

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

bootjp · 2026-05-16T16:41:35Z

@claude review

Stage 6 planning PR. Doc-only — decomposes the original Stage 6 row into 6A–6F sub-milestones and records 5D shipped. Each of 6A–6F is independently reviewable; 6A is the load-bearing safety boundary (lands the §6.3 applier whose absence was PR #760 r3 codex P1).

Looking for review on:

Sub-milestone scope coherence — does each row in the new table describe an independently-mergeable PR?
Dependency ordering — is the 6A → 6B → 6C → 6D → 6E → 6F sequence correct, or are there cross-cuts I missed?
Phase-1 dwell invariant — the rationale section claims storage-encrypted/raft-cleartext is a valid intermediate state operators can rollback from. Is that consistent with the rest of the design doc (§7.1, §9.1)?

No code paths touched.

claude · 2026-05-16T16:41:48Z

Claude finished @bootjp's task in 3m 52s —— View job

Stage 6 planning doc review

Read PR context and extract review questions
Read the full design doc (§6.2, §6.3, §6.5, §6.6, §7.1, §7.2, §9.1)
Review the diff
Assess sub-milestone scope coherence (Q1)
Assess dependency ordering (Q2)
Assess Phase-1 dwell invariant consistency (Q3)

Q1 — Sub-milestone scope coherence

The decomposition is solid overall, but three rows have gaps worth tightening before 6A lands as the next PR.

6A: applier + flag, but no KEK unwrapper yet

6A wires WithEncryption(applier) into the FSM and re-enables mutating RPCs behind --encryption-enabled. But the applier's ApplyBootstrap and ApplyRotation paths will need to KEK-unwrap the incoming wrapped DEKs, and that unwrapper doesn't exist until 6B. The rationale correctly notes that the flag defaults off, so this can't be triggered accidentally in production — but the 6A applier's KEK path will be called with a nil unwrapper if anyone sets the flag on a 6A-only binary.

The row should specify whether the 6A applier returns a typed error (e.g., ErrKEKNotConfigured) or panics on a nil KEK. A typed error is the right answer (consistent with the rest of the design's fail-closed stance), and stating it in the row sets the API contract for 6B to implement against. Without this, 6B's "thread a KEKUnwrapper into the applier" could add it as a post-hoc patch rather than a clean interface fill.

6C: flag description implies the flag is new

The 6C row reads §6.5 --encryption-enabled + §9.1 startup refusal guards. A reader scanning the table sees --encryption-enabled first appearing in 6C, but it was already introduced in 6A as a wire-mutators gate. This is a real source of confusion when filing the 6A and 6C PRs: the 6C PR reviewer won't know whether the flag is being introduced or extended, and the 6A PR reviewer may not know that the flag's full semantics are deferred to 6C.

Suggested rewording: extend --encryption-enabled with §9.1 startup refusal guards (KEK mismatch, sidecar/raft index gap, fsync support, local_epoch rollback/exhaustion, node_id collision). The --encryption-enabled already exists in 6A; 6C adds the §9.1 guards around it.

6E: two load-bearing pieces from §6.3 and §7.1 are not listed

The 6E row covers the enable-raft-envelope RPC, the cutover-index sidecar record, the engine hook activation, and ErrRaftUnwrapFailed. But it omits two other pieces that Phase 2 requires:

Coordinator-side wrap-on-propose (kv/coordinator.go / kv/sharded_coordinator.go) — §6.3 explicitly calls this out as a Phase-2 change: "wrap the proposal payload with the raft DEK before handing it to raftengine's submit path." Without the coordinator flipping its wrap-on-propose switch (step 5 of the Phase-2 quiescence sequence in §7.1), the engine-side unwrap hook fires on cleartext entries and breaks every apply after cutover. This is not in 6A (6A wires the FSM apply side, not the propose side) and is not in any other row. It belongs in 6E.
Proposal quiescence barrier (§7.1 Phase 2, steps 1–6 of leader_enable_raft_envelope) — the "block new USER proposal intake, drain in-flight queue, source-tag exemption" sequence is the safety mechanism that prevents cleartext entries from being appended at indexes > raft_envelope_cutover_index while the leader is still cleartext-proposing. The §7.1 text is unambiguous: "Without the barrier ... followers would ... attempt raftDEK.Unwrap on cleartext, and fail every apply with a GCM verification error." This is load-bearing correctness, not ergonomics, and it lives in the 6E PR. The row should name it.

Fix 6C/6E row wording →

Q2 — Dependency ordering

The 6A → 6B → 6C → 6D → 6E → 6F chain is correct for the primary dependencies. One cross-cut worth documenting explicitly in the rationale:

6B → 6D is load-bearing, not just transitive

The current rationale explains 6D as "the actual cutover" and notes it follows 6C, but doesn't call out why 6D must come after 6B specifically. The reason is that §7.1 Phase 1 step 4 (enable-storage-envelope) implicitly triggers the DEK bootstrap if active.storage == 0 — and that bootstrap (ApplyBootstrap) requires the KEK-unwrap path from 6B to create and wrap the initial DEK pair. If 6D were somehow attempted after 6C but without 6B, enable-storage-envelope would panic or return a KEK-not-configured error during the bootstrap sub-step. Stating this in the rationale closes the "why not 6C, 6D, 6B?" question before it gets asked in PR review.

One potential out-of-order observation: could 6F be merged after 6D?

--encryption-rotate-on-startup only requires a working rotation proposal path, which exists after 6A (the applier and mutating RPCs are wired). The flag doesn't require Phase 1 or Phase 2 to be active — it's just a convenience trigger for rotate-dek, which could be called manually from 6A onward. Placing 6F after 6E is the safest choice (no open questions about interaction with the Phase-2 quiescence barrier), and it's the right call for a doc-first plan, but the rationale could acknowledge this flexibility if operators ask to land the ergonomics earlier.

Q3 — Phase-1 dwell invariant

This is the sharpest issue in the PR.

The claim as written is not fully consistent with §7.2

The rationale section says:

"a valid intermediate state that operators can dwell in, observe, and rollback from if needed before committing to Phase 2"

Reading §7.2 ("Why we will not support encrypted → cleartext"):

"Disabling §4.2 would require rewriting WAL entries, which etcd raft does not support. The flag is therefore one-way; the only way back is dump-and-reload."

§7.2 specifically addresses §4.2 (raft envelope). But Phase 1 has a parallel one-way characteristic that §7.2 does not call out explicitly: once enable-storage-envelope has been Raft-committed on any node, the §9.1 guard activates — every node's data dir now contains a sidecar, and §9.1 refuses to start without --encryption-enabled. The elastickv-admin encryption disable command "refuses; documents the dump-and-reload path." So Phase 1 is also one-way under normal operation.

The dwell is valid; the "rollback" framing is not

The Phase-1-only state is genuinely safe as a pause point:

All new writes are storage-encrypted, all Raft proposals are still cleartext.
A node joining in Phase 1 sees a snapshot with encryption_state = 0b01 values and cleartext Raft entries; both are handled correctly by any encryption-capable binary.
Phase 2 has not started, so there is no raft_envelope_cutover_index and no WAL interleaving to worry about.
The operator can observe encryption status, run rewrite, and verify zero cleartext versions before deciding whether to proceed to Phase 2.

What operators cannot do is revert Phase 1: the enable-storage-envelope cluster flag is a committed Raft entry, sidecar presence triggers §9.1 startup guards on every node, and encryption disable refuses. The only path back to cleartext is dump-and-reload, same as Phase 2.

Suggested fix

Change the rationale text from "operators can dwell in, observe, and rollback from" to "operators can dwell in and observe indefinitely without committing to Phase 2" — and add a sentence noting that Phase 1 itself is also one-way, consistent with §7.2. This prevents an operator reading the design doc from expecting that "staying in Phase 1" gives them a true rollback option and planning their runbook around an escape hatch that doesn't exist.

Fix Phase-1 dwell wording →

Status check on 5D/5E changes

The 5D status update (shipped, PR #760) and the status header change are accurate. The 5E deferral rationale ("does not gate any Stage 6 work") is correct — the existing explicit-batch CLI from PR #759 is sufficient for hand-authored bootstraps, and the auto-batch convenience can land after 6F without any on-disk or wire-format dependency.

Summary

Area	Finding
5D/5E changes	Correct, no issues
6A scope	Clear, but should specify nil-KEK behavior (typed error, not panic)
6C row	Misleading — flag is introduced in 6A, not 6C; reword to "extend"
6E row	Missing coordinator wrap-on-propose + proposal quiescence barrier
6A→6F ordering	Correct; document 6B→6D as explicit load-bearing link
Phase-1 dwell claim	Valid as a pause point; "rollback from" is incorrect — Phase 1 is one-way per §7.2/§9.1

The 6A row and dependency chain are strong enough to start the 6A implementation. The two doc fixes (6C/6E row wording and Phase-1 dwell language) are low-risk edits that can land in this PR or as the first commit of 6A.

gemini-code-assist

Code Review

This pull request updates the data-at-rest encryption design document, marking Stage 5 as partially shipped and decomposing Stage 6 into six granular sub-milestones (6A-6F) to ensure incremental, safe delivery. A detailed rationale section was added to explain the dependencies and safety boundaries for each sub-stage. Feedback was provided to include specific section references in the milestone table and to expand the document with zero-downtime cutover strategies and operational risk mitigations for the proposed migrations.

gemini-code-assist · 2026-05-16T16:46:12Z

+| 6B | §6.5 KEK plumbing — `--kekFile` / `--kekUri` flags + `internal/encryption/kek` selection at startup + applier KEK-unwrap path | open | — |
+| 6C | §6.5 `--encryption-enabled` + §9.1 startup refusal guards (sidecar present without flag, KEK mismatch, sidecar/raft index gap, filesystem fsync support, `local_epoch` rollback / exhaustion, `node_id` collision) | open | — |
+| 6D | §6.6 `enable-storage-envelope` admin RPC + §7.1 Phase-1 storage cutover (§6.2 toggle ON) + Voters ∪ Learners capability gate | open | — |
+| 6E | §6.6 `enable-raft-envelope` admin RPC + §7.1 Phase-2 raft cutover + `raft_envelope_cutover_index` sidecar record + `internal/raftengine/etcd/engine.go` `applyNormalEntry` unwrap hook activation + `ErrRaftUnwrapFailed` HaltApply path | open | — |


For consistency and clarity, consider adding a reference to §6.3 in the scope of this sub-milestone. Furthermore, as this design involves a 'Phase-2 raft cutover' (a production-grade system migration), please ensure the document details strategies for zero-downtime cutover and mitigations for operational risks like rolling upgrade compatibility, as required by repository guidelines.

Suggested change

| 6E | §6.6 `enable-raft-envelope` admin RPC + §7.1 Phase-2 raft cutover + `raft_envelope_cutover_index` sidecar record + `internal/raftengine/etcd/engine.go` `applyNormalEntry` unwrap hook activation + `ErrRaftUnwrapFailed` HaltApply path | open | — |

| 6E | §6.6 enable-raft-envelope admin RPC + §7.1 Phase-2 raft cutover + raft_envelope_cutover_index sidecar record + internal/raftengine/etcd/engine.go applyNormalEntry unwrap hook activation (§6.3) + ErrRaftUnwrapFailed HaltApply path | open | — |

References

When a design document identifies a significant operational risk, such as the inability to perform rolling upgrades, it must also detail potential mitigation strategies, like implementing a temporary 'bridge' or 'proxy' mode.

When designing a production-grade system migration, the plan must consider and detail strategies for live or zero-downtime cutovers (e.g., blue/green, dual-write proxy, logical replication) to avoid service interruption.

…hase-1 dwell wording claude[bot] round-1 review on cbaa364 surfaced three actionable doc improvements (no blocking issues, plan acceptable to start 6A implementation): 1. **6C row was misleading.** `--encryption-enabled` is introduced in 6A as the wire-mutators gate, not in 6C. The 6C row read as if the flag were new in 6C, which would confuse the 6A vs 6C reviewer split. Rewrote 6C as "Extend --encryption-enabled (introduced in 6A) with §9.1 startup refusal guards" so the boundary is explicit. 2. **6E row omitted two §7.1 Phase-2 pieces.** The original 6E row covered the admin RPC, sidecar record, engine unwrap hook, and ErrRaftUnwrapFailed — but not the coordinator-side wrap-on-propose switch nor the §7.1 steps 1–6 proposal quiescence barrier. Both are load-bearing safety mechanisms for Phase 2: without the barrier, cleartext entries appended at indexes > cutover would fail every apply with a GCM verification error on followers. Added both pieces to the 6E row so the 6E implementer cannot accidentally ship Phase 2 without them. 3. **Phase-1 dwell rationale claimed a non-existent rollback option.** The original text said operators could "dwell in, observe, and rollback from" Phase 1 before committing to Phase 2. §7.2 is unambiguous: encrypted → cleartext is not supported under any phase; the only path back is dump-and-reload. §9.1's sidecar-present guard activates cluster-wide on the first enable-storage-envelope commit, so Phase 1 is also one-way. Rewrote to "dwell in and observe indefinitely before committing to Phase 2" and added an explicit "Phase 1 is not a rollback point" note so operators planning their runbook cannot mistake the dwell for an escape hatch. ## Additional improvements (claude[bot] optional / preventive) - **6A nil-KEK contract**: added explicit "returns typed ErrKEKNotConfigured through the ErrEncryptionApply HaltApply pipeline" wording to the 6A row + rationale bullet, so 6B has a clean API contract to implement against (drop a KEKUnwrapper in, the existing path fills). Documents that ApplyRegistration is fully functional in 6A (no KEK needed); only ApplyBootstrap and ApplyRotation hit the stub. - **6D → 6B load-bearing dependency**: added a separate rationale bullet calling out that §7.1 Phase 1 step 4 implicitly triggers the §5.6 DEK bootstrap when active.storage == 0, which routes through ApplyBootstrap's KEK-unwrap. This is a non-obvious cross-cut: the nominal 6C → 6D ordering is insufficient; 6B must precede 6D regardless of 6C's status. Stating this in the rationale closes the "why not 6C, 6D, 6B?" question before it gets asked in PR review. ## Caller audit Documentation-only PR — no code paths touched, no callers to audit. ## Verification - git diff: only the design doc changed, 44 insertions / 8 deletions. - The local pre-commit hook (make lint) reported 0 issues (no code paths to lint). Plan is now ready to start 6A implementation as the next PR.

bootjp · 2026-05-16T16:48:08Z

@claude review

Round-2 (commit 1b3cabf7): addressing your round-1 verdict on cbaa364d. All three actionable findings + both optional improvements applied.

Round-1 findings addressed

#	Finding	Resolution
1	6C row misleading — `--encryption-enabled` introduced in 6A, not 6C	6C row reworded: "Extend `--encryption-enabled` (introduced in 6A) with §9.1 startup refusal guards"
2	6E row omits coordinator wrap-on-propose + §7.1 quiescence barrier	6E row extended with both pieces — `kv/coordinator.go` / `kv/sharded_coordinator.go` wrap-on-propose switch + §7.1 steps 1–6 quiescence barrier (block new intake / drain / source-tag exemption)
3	Phase-1 dwell "rollback from" wrong — Phase 1 is also one-way per §7.2/§9.1	Reworded to "dwell in and observe indefinitely" + added explicit "Phase 1 is not a rollback point" note citing §9.1 sidecar-present guard and §7.2

Optional / preventive improvements also applied

6A nil-KEK API contract — explicitly states ApplyBootstrap / ApplyRotation return typed ErrKEKNotConfigured through the existing ErrEncryptionApply HaltApply pipeline; ApplyRegistration is fully functional in 6A (no KEK needed). 6B's contract is now "drop in a KEKUnwrapper, the path fills."
6D → 6B load-bearing dependency — added rationale bullet explaining §7.1 Phase 1 step 4 implicitly triggers the §5.6 DEK bootstrap which routes through ApplyBootstrap's KEK-unwrap. The nominal 6C → 6D ordering is insufficient; 6B must precede 6D.

Verification

Doc-only — git diff shows 44 insertions / 8 deletions in the design doc, no code paths touched. make lint 0 issues (no code to lint).

claude · 2026-05-16T16:48:22Z

Claude finished @bootjp's task in 1m 43s —— View job

Round-2 review

Read trigger comment and understand round-2 scope
Read current state of design doc (commit 1b3cabf7)
Verify finding Configure Renovate #1: 6C row reworded to "extend"
Verify finding Update dependency go to v1.21.4 #2: 6E row extended with coordinator wrap-on-propose + §7.1 quiescence barrier
Verify finding Update docker/setup-qemu-action action to v3 #3: Phase-1 dwell language corrected
Verify optional: 6A nil-KEK API contract documented
Verify optional: 6D → 6B load-bearing dependency documented
Check for any new issues introduced in round-2 edits

Verdict: all five round-1 findings correctly addressed

#	Round-1 finding	Round-2 resolution	Status
1	6C row misleading — `--encryption-enabled` introduced in 6A	"Extend `--encryption-enabled` (introduced in 6A) with §9.1 startup refusal guards"	✅
2	6E row missing coordinator wrap-on-propose + §7.1 quiescence barrier	Both added: `kv/coordinator.go` / `kv/sharded_coordinator.go` wrap-on-propose switch + §7.1 steps 1–6 barrier	✅
3	Phase-1 dwell "rollback from" inconsistent with §7.2/§9.1	"dwell in and observe indefinitely" + explicit "Phase 1 is not a rollback point" paragraph citing §9.1 and §7.2	✅
4 (opt)	6A nil-KEK API contract unspecified	`ErrKEKNotConfigured` through `ErrEncryptionApply` HaltApply pipeline stated in both table row and rationale bullet; `ApplyRegistration` carve-out noted	✅
5 (opt)	6D → 6B load-bearing link not documented	6D table row + dedicated rationale bullet: §7.1 Phase 1 step 4 → `ApplyBootstrap` KEK-unwrap → `ErrKEKNotConfigured` halt without 6B	✅

One new issue introduced in round-2

Counting error in the 6F rationale bullet (docs/design/2026_04_29_partial_data_at_rest_encryption.md, line ~133 in the current file):

"6F lands the convenience flag after the four load-bearing PRs (6A–6E) are in."

6A, 6B, 6C, 6D, 6E is five PRs, not four. The range and the count contradict each other. Suggested fix:

6F lands the convenience flag after the five load-bearing PRs (6A–6E) are in.

This is the only defect in the round-2 edits. Everything else — the milestone table, the rationale section, the Phase-1 dwell language — is internally consistent and accurate.

Fix counting error →

Gemini-code-assist suggestion (6E row + zero-downtime)

The automated review asked for a §6.3 reference in the 6E scope — that reference is now present inline as "(Phase-2 leader-side §6.3 proposal-payload wrap)". The zero-downtime concern is addressed by the quiescence barrier (§7.1 steps 1–6, now in the 6E row) and rolling-upgrade compatibility is gated by the Voters ∪ Learners capability check in 6D — both mechanisms are now named in the table. No further additions needed there.

Summary

The doc is ready to merge after the one-word fix (four → five). All load-bearing scope descriptions are accurate, the dependency chain is correctly documented, and the Phase-1 dwell invariant is now precisely stated. 6A can begin implementation against this milestone table.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1b3cabf772

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-16T16:50:09Z

+  fills the path; until that lands, any operator who
+  hand-triggers `BootstrapEncryption` or `RotateDEK` on a
+  6A-only binary with `--encryption-enabled` set hits a clean,
+  named error in the gRPC error rather than a process crash.


Avoid treating 6A KEK-missing applies as harmless errors

This section says a 6A-only cluster will return a clean gRPC error if an operator triggers BootstrapEncryption/RotateDEK before 6B, but the current FSM contract is fail-closed: kv/fsm_encryption.go wraps ApplyBootstrap/ApplyRotation errors with ErrEncryptionApply, and internal/raftengine/etcd/engine.go halts apply without advancing setApplied when HaltApply() is non-nil. If such an entry is Raft-committed, replicas can repeatedly halt on that index after restart, so this is an availability-bricking path rather than a recoverable request error.

Useful? React with 👍 / 👎.

Two issues on round-2 (commit 1b3cabf): ## codex P1: 6A re-wiring mutators creates the same cluster-halt class as PR #760 Round-2 wrote: "any operator who hand-triggers BootstrapEncryption or RotateDEK on a 6A-only binary with --encryption-enabled set hits a clean, named error in the gRPC error rather than a process crash." Codex correctly flagged this as wrong: - The FSM contract is fail-closed: kv/fsm_encryption.go wraps ApplyBootstrap/ApplyRotation errors with ErrEncryptionApply. - raftengine/etcd/engine.go halts apply without advancing setApplied when HaltApply() is non-nil. - So a 6A-only mutator RPC: handler proposes, entry commits, every replica's FSM apply returns ErrKEKNotConfigured, HaltApply fires, apply loop halts. Restart replays the same entry → halts again. Cluster bricks on that index. This is the same class of failure as PR #760 round-3 codex P1. The fix for #760 was "don't wire mutators until applier exists"; the principled extension is "don't wire mutators until applier AND KEK are both present." ## Fix: revise 6A/6B split - **6A revised**: lands the §6.3 EncryptionApplier as INTERNAL scaffolding only. WithEncryption(applier) is wired at FSM construction so the dispatch path exists, but registerEncryptionAdminServer carries forward Stage 5D's posture — Proposer + LeaderView stay unwired, mutating RPCs continue to refuse with FailedPrecondition. No flag introduced. The applier is reachable only through direct FSM.Apply calls in unit tests. 6A's value is unblocking 6B/6D by providing a real applier to bind KEK and cutover work against. - **6B revised**: lands KEK plumbing AND introduces the --encryption-enabled flag AND re-enables the mutator wiring gated on (--encryption-enabled AND KEKConfigured()). Both conditions must hold, so a binary with the flag set but no KEK source loaded continues to refuse mutators at the RPC boundary — preventing the cluster-halt path codex identified. - **6C revised**: now extends the flag "introduced in 6B" (not 6A) since the introduction moved to 6B. - **6D rationale rewritten**: the previously-special "6D depends on 6B because §7.1 Phase 1 step 4 implicitly triggers ApplyBootstrap" dependency is now a special case of the general "6B gates all mutator wiring" rule. Both 6D and 6E depend on 6B unconditionally — 6B is what makes any EncryptionAdmin mutator reachable. The 6D row drops the long inline dependency note since the rationale section covers it. The defense-in-depth ErrKEKNotConfigured marker in the applier is retained — a future refactor that bypasses the 6B RPC gate should produce a named grep-able error in logs rather than a nil pointer dereference. The marker is no longer the production safety boundary (the RPC gate is); it is a belt-and-suspenders backstop. ## claude[bot] round-2 nit: "four" → "five" The 6F rationale bullet said "after the four load-bearing PRs (6A–6E)". The range 6A–6E is five PRs. Fixed. ## Caller audit Doc-only — no code paths touched, no callers. ## Verification - git diff scope: design doc only, 81 insertions / 53 deletions. - make lint: 0 issues (no code to lint).

bootjp · 2026-05-16T16:54:33Z

@claude review

Round-3 (commit a5754f60): addressing codex round-2 P1 (cluster-halt class regression in 6A's plan) + claude[bot] round-2 four→five count.

codex round-2 P1

Codex flagged round-2 wording "any operator who hand-triggers BootstrapEncryption or RotateDEK on a 6A-only binary with --encryption-enabled set hits a clean, named error in the gRPC error rather than a process crash" as wrong:

the current FSM contract is fail-closed: kv/fsm_encryption.go wraps ApplyBootstrap/ApplyRotation errors with ErrEncryptionApply, and internal/raftengine/etcd/engine.go halts apply without advancing setApplied when HaltApply() is non-nil. If such an entry is Raft-committed, replicas can repeatedly halt on that index after restart, so this is an availability-bricking path rather than a recoverable request error.

This is the same cluster-halt class as PR #760 r3 codex P1.

Fix: revise 6A/6B split

6A lands the §6.3 EncryptionApplier as internal scaffolding only. WithEncryption(applier) is wired at FSM construction; registerEncryptionAdminServer carries forward Stage 5D's posture — Proposer + LeaderView stay unwired, mutators refuse with FailedPrecondition. No flag introduced. Reachable only via direct FSM.Apply unit tests.
6B lands KEK plumbing AND introduces --encryption-enabled AND re-enables mutator wiring gated on (--encryption-enabled AND KEKConfigured()). Both conditions must hold, so a binary with the flag set but no KEK source loaded continues to refuse mutators at the RPC boundary.
6C now extends the flag "introduced in 6B" (not 6A).
6D rationale rewritten — the special-case 6D→6B dependency is now a case of the general "6B gates all mutator wiring" rule. Both 6D and 6E depend on 6B unconditionally.

The defense-in-depth ErrKEKNotConfigured marker in the applier is retained as a belt-and-suspenders backstop in case a future refactor bypasses the 6B RPC gate.

claude[bot] round-2 nit

four → five (6A–6E is five PRs). Fixed.

Caller audit

Doc-only — no code paths touched.

Verification

git diff: 81 ins / 53 del, design doc only.
make lint: 0 issues.

claude · 2026-05-16T16:54:48Z

Claude finished @bootjp's task in 3m 1s —— View job

Round-3 review

Read trigger comment and understand round-3 scope
Read current state of design doc (commit a5754f60)
Verify codex P1 fix: 6A/6B split revision (cluster-halt prevention)
Verify round-2 nit: four→five count
Verify 6C now references "introduced in 6B" not "6A"
Verify 6D rationale rewrite
Check for any new issues introduced in round-3 edits

Verdict: all round-2 findings correctly addressed; one minor precision issue in the 6A rationale

#	Round finding	Round-3 resolution	Status
Codex P1	cluster-halt class if KEK absent when mutators fire	6A: scaffolding only, mutators OFF; 6B: flag + KEK + mutator gate, both conditions required	✅
Claude r2 nit	"four load-bearing PRs (6A–6E)"	"five load-bearing PRs (6A–6E)" (line 161)	✅
6C flag attribution	"introduced in 6A"	"(introduced in 6B)"	✅
6D → 6B dependency	special-cased to the ApplyBootstrap KEK path	generalized: 6B gates all mutator wiring; 6D/6E both depend on 6B unconditionally	✅

One new precision issue: "reachable only via unit tests" is not quite right

docs/design/2026_04_29_partial_data_at_rest_encryption.md, line 75–76 — the 6A rationale bullet reads:

"The applier is reachable only through direct FSM.Apply calls in unit tests; no operator-facing surface changes in 6A."

This is accurate for a same-version 6A cluster, but not in a rolling 6A→6B upgrade. WithEncryption(applier) is wired at FSM construction — the dispatch path is live. A 6A node that receives a committed 0x03/0x04 Raft entry (originating from a 6B+ sibling with --encryption-enabled set and mutators re-enabled) will call ApplyBootstrap/ApplyRotation, return ErrKEKNotConfigured, and halt at that index.

This is the correct behavior — a 6A node cannot safely apply an entry it cannot KEK-unwrap, and the fail-closed halt is what you want — but describing the applier as "only reachable via unit tests" elides the mixed-version case.

Suggested wording:

"The applier is operator-inert in a same-version 6A cluster: no 6A binary can propose 0x03/0x04 entries (mutators are gated OFF), so the FSM dispatch path is exercised only in direct FSM.Apply unit tests. During a rolling 6A→6B upgrade, operators must not set --encryption-enabled until all nodes are on 6B+; a 0x03/0x04 entry committed by a 6B node while a 6A replica is still in the ring will correctly halt the 6A replica with ErrKEKNotConfigured."

Fix wording →

Severity: low — the cluster's behavior is correct (fail-closed halt is right), and a 6A-only cluster has no exposure. This only matters for the rolling upgrade runbook. The 6A PR itself can carry the precise language.

One clarity note: "6C orthogonal... can ship at any point after 6B"

Line 150–151: "the actual shipping dependency is 6A → 6B → 6D → 6E with 6C orthogonal (it strictly extends --encryption-enabled guards and can ship at any point after 6B)"

This is structurally correct (6C has no code dependency on 6D/6E), but the phrase "can ship at any point after 6B" implies 6C could trail 6D. Two of 6C's guards — KEK mismatch and sidecar/raft index gap — become operationally load-bearing precisely during and after the 6D Phase-1 cutover. Running enable-storage-envelope without the KEK-mismatch guard in place means a KEK rotation between Phase 1 and Phase 2 could leave nodes that refuse to start on the next restart with no diagnosable guard. The nominal table order (6C before 6D) is correct; the "orthogonal" note in the rationale could be tightened to avoid implying the trailing-order is safe in practice.

Suggested addendum: "In practice, 6C should ship before 6D — the KEK-mismatch and sidecar/raft-index-gap guards are most load-bearing during the Phase-1 cutover window."

Severity: low — the table order already puts 6C before 6D, and this is a doc-clarity issue rather than a correctness issue.

Summary

The codex P1 fix is correct and complete. The 6A/6B split cleanly separates the FSM-apply scaffolding from the RPC-gate boolean, the double gate (--encryption-enabled AND KEKConfigured()) correctly prevents the cluster-halt class, and the ErrKEKNotConfigured defense-in-depth framing is accurate. The five-count fix and the 6C flag attribution are in. The two items above are low-severity documentation precision issues; neither blocks 6A implementation. The doc is ready to merge as-is, with the precision fixes optionally folded into the 6A implementation PR.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a5754f6027

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-16T16:56:49Z

+  them), but the actual shipping dependency is 6A → 6B → 6D
+  → 6E with 6C orthogonal (it strictly extends `--encryption-enabled`


Make 6C a prerequisite for encryption cutovers

Treating 6C as optional ordering (6A → 6B → 6D → 6E with 6C “orthogonal”) is unsafe because §9.1’s startup refusal checks are what prevent post-cutover downgrade/misconfiguration states (for example starting with a sidecar but --encryption-enabled off, or restarting from stale local_epoch state that can reissue nonces). If 6D/6E ship before those guards, operators can enable envelopes and then restart into states the design elsewhere assumes are blocked, violating the security and correctness invariants this rollout is meant to enforce.

Useful? React with 👍 / 👎.

Codex r4 P1 (line 150): the r3 rationale described 6C (§9.1 startup refusal guards) as orthogonal/optional with the shipping order 6A → 6B → 6D → 6E, but the §9.1 invariants are what prevent post-cutover restart misconfigurations. If 6D/6E ship before 6C is in the binary: - An operator who restarts a node with sidecar present but --encryption-enabled off silently reads cleartext past an envelope cutover (§9.1 sidecar-present-without-flag guard not yet wired). - A binary without local_epoch rollback / exhaustion guards can reissue a stale local_epoch and recycle a previously- used GCM nonce (§4.1 invariant violation). - KEK mismatch / sidecar-raft index gap / fsync-unsupported / node_id collision states all bypass the documented refusal. All four are security-class regressions the §9.1 guards exist to prevent, so 6C is a hard predecessor for the cutover milestones, not a separable cleanup. Update the milestone-rationale paragraph to "6A → 6B → 6C → 6D → 6E with 6C as a hard prerequisite". Add the dependency text to the 6D and 6E rows of the milestone table. The 6D-implicit-bootstrap rationale (§7.1 Phase 1 step 4 routing through ApplyBootstrap's KEK-unwrap) stays correct, and 6C is added as a second hard predecessor alongside 6B rather than replacing it. Proposal-only — no code in this PR.

bootjp · 2026-05-16T17:01:22Z

@claude review

Round 4 — Codex P1 (line 150): r3 described 6C (§9.1 startup refusal guards) as orthogonal/optional with order 6A → 6B → 6D → 6E, but §9.1 invariants prevent post-cutover restart misconfigurations. Without 6C in the binary:

sidecar-present-without-flag → silently read cleartext past envelope cutover
stale local_epoch → recycle GCM nonce (§4.1 invariant violation)
KEK mismatch / index gap / fsync-unsupported / node_id collision bypass refusal

All four are security-class regressions. 6C is now a hard prerequisite for 6D/6E, not orthogonal.

Updated rationale paragraph: "6A → 6B → 6C → 6D → 6E with 6C as a hard prerequisite". Milestone table rows for 6D / 6E both call out the 6C dependency explicitly.

Proposal-only.

claude · 2026-05-16T17:01:34Z

Claude finished @bootjp's task in 1m 21s —— View job

Round-4 review

Read trigger comment and understand round-4 scope
Read current state of design doc (commit abd68cb1)
Verify Codex P1 fix: 6C elevated to hard prerequisite for 6D/6E
Verify 6D/6E rows explicitly call out 6C dependency
Verify updated rationale paragraph ordering
Check for any new issues introduced in round-4 edits

Verdict: Codex P1 correctly addressed; no new issues introduced

#	Finding	Round-4 resolution	Status
Codex r4 P1	6C treated as "orthogonal" — §9.1 invariants violated if 6D/6E ship without it	Ordering rewritten: "6A → 6B → 6C → 6D → 6E with 6C as a hard prerequisite, NOT optional/orthogonal"	✅
6D row	Only listed 6B dependency	"depends on 6B for mutator wiring AND 6C for §9.1 startup-refusal guards"	✅
6E row	Only listed quiescence barrier; no 6C mention	"Depends on 6B for mutator wiring AND 6C for §9.1 startup-refusal guards" appended	✅
Rationale bullet	"6C orthogonal, can ship at any point after 6B"	Removed; replaced with hard-prerequisite explanation including the two specific security regressions	✅
Old "special case" framing	6D-implicit-bootstrap rationale treated as the dependency reason	Correctly preserved as a second hard predecessor ("6C-gate adds a second hard predecessor rather than replacing the 6B one")	✅

The Round 4 diff is clean: the "6D and 6E both depend on 6B unconditionally" paragraph is replaced with a full security-class explanation of why 6C is a hard prerequisite — sidecar-present-without-flag silently reads cleartext past an envelope cutover, and a stale local_epoch recycles a GCM nonce (§4.1 invariant violation). Both regressions are exactly the violations Codex r4 P1 named.

One carry-over note from R3 (pre-existing, low severity)

The 6A table row and rationale still read "reachable only through direct FSM.Apply calls in unit tests; no operator-facing surface changes in 6A." As R3 flagged, this is accurate for a same-version 6A cluster but elides the mixed-version case: a 6A replica that receives a Raft-committed 0x03/0x04 entry from a 6B+ sibling will correctly halt with ErrKEKNotConfigured — fail-closed behavior is correct, but the description understates the surface. This was already noted in R3 as "the 6A PR itself can carry the precise language." Still the right call; no change needed in this doc PR.

Summary

The Codex P1 fix is correct and complete. The dependency chain 6A → 6B → 6C → 6D → 6E is now accurately documented in both the milestone table (6D and 6E rows) and the rationale section, with the security rationale for the 6C constraint spelled out. No new defects introduced in Round 4. The doc is ready to merge.

…n-entry caveat claude[bot] round-3 low-severity (line 75–76): the r3 6A bullet described the applier as 'reachable only through direct FSM.Apply calls in unit tests', which is accurate for a same-version 6A cluster but elides the rolling 6A→6B upgrade case. Because WithEncryption(applier) is wired in 6A, a 6A node co-resident with a 6B+ node will dispatch any 0x03/0x04 entry the 6B leader commits to its (KEK-less) applier and halt with ErrKEKNotConfigured. The halt itself is fail-closed by design — a 6A replica cannot safely apply an entry it cannot KEK-unwrap — but operators must not set --encryption-enabled on a 6B+ node until every Raft member (voters AND learners, same Voters ∪ Learners gate as §7.1) is on 6B+. This commit: - Reworded the 6A row to acknowledge the caveat and point at the rationale section for the full story. - Reworked the 6A rationale bullet to: - Replace 'reachable only through... unit tests' with 'operator-inert in a same-version 6A cluster (exercised only via direct FSM.Apply unit tests)' — preserves the same-version meaning without claiming the cross-version case is unreachable. - Add a new 'Rolling 6A→6B upgrade caveat' paragraph naming the Voters ∪ Learners gate and pointing forward at 6B's runbook. Doc-only — no code paths touched, no callers.

bootjp · 2026-05-16T17:03:32Z

@claude review

Round-4: 2 commits — user's parallel abd68cb1 addressing codex r4 P1 (6C as hard prereq) + my 4b9a186e addressing your r3 low-severity rolling-upgrade note.

codex r4 P1 (user's commit `abd68cb1`)

User addressed the same finding I had locally — "6C orthogonal" was unsafe per §9.1 invariants. Their commit:

Updated 6D + 6E rows to add "AND 6C for §9.1 startup-refusal guards"
Rewrote the 6D/6E rationale bullet to make 6C unconditional with two concrete examples (sidecar-present-without-flag silent downgrade + local_epoch rollback nonce-reuse).

I reset my local divergent commit and accepted theirs as canonical, since their wording is more direct.

claude[bot] r3 low-severity (my commit `4b9a186e`)

Your r3 verdict flagged that the 6A bullet's "reachable only through direct FSM.Apply calls in unit tests" wording elides the rolling 6A→6B upgrade case — a 6A node will dispatch a 6B-committed 0x03/0x04 entry to its KEK-less applier and halt with ErrKEKNotConfigured.

Applied your suggested rewording verbatim:

6A row: now reads "operator-inert in a same-version 6A cluster (exercised only via direct FSM.Apply unit tests). Rolling 6A→6B upgrade has a halt-on-foreign-entry caveat — see rationale"
6A rationale: added a new "Rolling 6A→6B upgrade caveat" paragraph naming the Voters ∪ Learners gate and pointing forward at 6B's runbook for the operator constraint.

Caller audit

Doc-only — no code paths touched.

Verification

git diff (vs prior round-3 HEAD a5754f6): 76 ins / 25 del across both commits, design doc only.
make lint: 0 issues (no code to lint).

claude · 2026-05-16T17:03:43Z

Claude finished @bootjp's task in 2m 36s —— View job

Round-5 review

Read trigger comment and understand round-5 scope
Read current state of design doc (commits abd68cb1 + 4b9a186e)
Verify codex r4 P1 fix: 6C as hard prerequisite, 6D/6E rows updated
Verify r3 low-severity fix: 6A rolling upgrade caveat added
Check for any new issues introduced in round-5 edits

Verdict: all round-4/r4 findings correctly addressed; no new defects

#	Finding	Round-5 resolution	Status
Codex r4 P1	"6C orthogonal" — §9.1 startup guards bypass after 6D/6E is a security-class regression	6D row: "depends on 6B... AND 6C for §9.1 startup-refusal guards"; 6E row: same; rationale rewritten as "6C as a hard prerequisite, NOT optional/orthogonal" with two concrete examples	✅
Codex r4 P1 examples	Two specific failure modes should be named	sidecar-present-without-flag → silent cleartext read past envelope cutover; stale `local_epoch` → GCM nonce recycle	✅
Claude r3 low-severity	6A "only reachable via unit tests" elides rolling upgrade case	6A row: "Rolling 6A→6B upgrade has a halt-on-foreign-entry caveat — see rationale"; 6A rationale: new paragraph with Voters ∪ Learners gate, operator constraint, 6B runbook forward-pointer	✅
Operator constraint precision	Must hold: every node in every Raft group on 6B+ before enabling flag	"must not set `--encryption-enabled` on a 6B+ node until every member of every Raft group (voters AND learners)" stated	✅

One pre-existing nuance in the 6C rationale bullet (not introduced in round-5)

The 6C rationale bullet reads: "They are independent of the FSM apply path and the cutover RPCs, so they land in their own PR."

This sentence correctly explains why 6C is a separate PR (code-level independence). But read alongside the subsequent bullet ("6C as a hard prerequisite, NOT optional/orthogonal"), it could leave a reader uncertain whether "independent of the cutover RPCs" means "no ordering constraint relative to 6D/6E." The code is independent; the safety ordering is not.

This pre-dates round-5 and was the root cause of the "orthogonal" reading that codex r4 P1 corrected. The subsequent rationale bullet already contains the authoritative statement. Whether to patch the 6C bullet itself or leave the adjacent bullet as the authoritative fix is judgment call — I'd lean toward a one-sentence addendum ("independent in implementation; ordered before 6D/6E for safety — see the next bullet"), but it is not blocking.

Severity: very low / cosmetic — the hard-prerequisite language in the next bullet is unambiguous, and the confusion it risks would have to survive reading two adjacent paragraphs.

No new defects in the round-5 commits

Both commits are precise and narrow:

abd68cb1: 6D/6E table rows extended with the 6C dependency clause; rationale bullet rewrites the "6C orthogonal" paragraph as a named prerequisite with security-class failure examples. The two examples (sidecar-present-without-flag, stale local_epoch) are correctly drawn from §9.1 and §4.1 respectively.
4b9a186e: 6A table row shortened to a caveat pointer; 6A rationale acquires a new "Rolling 6A→6B upgrade caveat" paragraph. The constraint is stated with the correct scope (voters AND learners, per §7.1 principle), the fail-closed rationale is stated, and the 6B runbook forward-pointer closes the loop.

The five-count (6A–6E = five PRs) remains correct. The 6C flag attribution ("introduced in 6B") remains correct. All round-1 through round-4 findings remain resolved.

The doc is ready to merge. 6A implementation can begin against this milestone table.

… of cutovers' claude[bot] r5 cosmetic note (judgment call, not blocking): the 6C rationale bullet read 'They are independent of the FSM apply path and the cutover RPCs, so they land in their own PR.' That 'independent of the cutover RPCs' phrase, read on its own, could imply 'no ordering constraint relative to 6D/6E' — which is exactly the wrong reading codex r4 P1 corrected. The adjacent bullet contains the authoritative hard-prerequisite statement, but a one-sentence addendum here removes the possible confusion before the reader gets to it. Fix: clarify that the independence is at the implementation level only — the safety ordering is 6C before 6D/6E — and point forward at the next bullet for the security rationale. Doc-only — no code paths touched, no callers.

bootjp · 2026-05-16T17:09:40Z

@claude review

Round-5 (commit 9c3dcea8): addressing the one cosmetic nuance from your r5 verdict.

r5 cosmetic note

Your r5 verdict said the doc was ready to merge but flagged this 6C rationale sentence as a judgment-call clarification:

"They are independent of the FSM apply path and the cutover RPCs, so they land in their own PR."

The phrase "independent of the cutover RPCs" — read on its own — could imply "no ordering constraint relative to 6D/6E", which is precisely the wrong reading codex r4 P1 corrected.

Fix

Reworded to clarify the independence is implementation-level only, and explicitly point forward at the next bullet for the safety ordering:

"The guard code is independent of the FSM apply path and the
cutover RPCs (it runs at startup, before either has a chance
to fire), so 6C lands in its own PR. The independence is at
the implementation level only — the safety ordering is 6C
before 6D/6E, see the next bullet for why."

Removes the possible misreading before it reaches the reader.

Caller audit

Doc-only — no code paths touched.

Verification

git diff: 6 insertions / 3 deletions, design doc only.
make lint: 0 issues.

claude · 2026-05-16T17:09:54Z

Claude finished @bootjp's task in 51s —— View job

Round-6 review

Read trigger comment and understand round-6 scope
Read current state of design doc (commit 9c3dcea8)
Verify r5 cosmetic fix: 6C rationale sentence reworded
Check for any new issues introduced in round-6 edits

Verdict: r5 cosmetic fix correctly applied; no new issues

#	Finding	Round-6 resolution	Status
Claude r5 cosmetic	"independent of the cutover RPCs" implied no ordering constraint vs 6D/6E	Reworded with implementation-vs-safety-ordering distinction + forward pointer to the next bullet	✅

The diff is exactly 6 insertions / 3 deletions in the 6C rationale bullet, as stated. The new text is strictly better than my r5 suggestion — the parenthetical "(it runs at startup, before either has a chance to fire)" I did not specify but is a concrete mechanistic explanation for why 6C's code is decoupled from the cutover RPCs. A reader who wonders "what does independent mean here?" gets an immediate answer without having to infer it from the FSM/RPC structure.

The two-sentence structure now reads cleanly:

Why 6C is its own PR — implementation independence (startup-path code, not FSM-apply or cutover-RPC code).
Why that independence does not relax the ordering — bolded constraint pointing at the next bullet for the security rationale.

The adjacent bullet ("6C as a hard prerequisite, NOT optional/orthogonal") remains the authoritative statement of the safety ordering. The r5 fix now makes the 6C bullet self-consistent rather than relying on the reader to synthesize across two paragraphs.

No carry-over issues remain. The cumulative state of all six rounds:

Round	Key finding	Status
r1	6C row misleading; 6E missing coordinator wrap-on-propose + quiescence barrier; Phase-1 "rollback" wrong	All resolved in r2
r2	Counting error: four→five (6A–6E)	Resolved in r3
r3	6A "only via unit tests" elides rolling upgrade case	Resolved in r4
r4	6C orthogonal/optional is a security-class safety violation; 6C must be a hard prereq for 6D/6E	Resolved in r4 (codex P1) + r4 (claude caveat)
r5	6C independence sentence could be misread as "no ordering constraint"	Resolved in r6 (this commit)

The doc is ready to merge. 6A implementation can begin against this milestone table.

… + FSM WithEncryption wiring Second slice of Stage 6A per the PR #762 plan. Wires the §6.3 Applier (landed in part 1) into the FSM construction path with a Pebble-backed WriterRegistryStore adapter so the 0x03 RegisterEncryptionWriter dispatch path is fully functional end-to-end. Mutator-RPC wiring in registerEncryptionAdminServer remains OFF per Stage 5D's posture — re-enabled in Stage 6B alongside KEK plumbing gated on (--encryption-enabled AND KEKConfigured()). ## What this commit ships ### store/encryption_glue.go (additions) - pebbleWriterRegistry — internal adapter satisfying encryption.WriterRegistryStore. Both Get and Set hold dbMu for read only (the apply path's applyMu in kv/fsm.go is the authoritative serialiser for the §4.1 case-dispatch read-modify-write). Set uses pebble.Sync so the row is durable before FSM apply returns — a crash between FSM dispatch returning success and Pebble flushing the WAL would otherwise leave the Raft log claiming the entry applied but the registry row missing on restart, violating §4.1 case 2's monotonic-epoch invariant. - WriterRegistryFor(MVCCStore) factory — type-asserts against *pebbleStore and returns the adapter, or ErrUnsupportedStoreForWriterRegistry if the store is not the Pebble-backed implementation. Construction-time refusal so the failure surface is "binary refuses to start" rather than "FSM apply panics inside Raft loop". - ErrUnsupportedStoreForWriterRegistry — new sentinel. ### store/writer_registry_test.go (new) Three table-driven unit tests: - TestWriterRegistryFor_RoundTrip — fresh Pebble store, missing row reads as (nil, false, nil), Set writes, Get returns byte-identical value. - TestWriterRegistryFor_Overwrite — Set on existing key replaces the value; no MVCC versioning, no append. - TestWriterRegistryFor_RejectsNonPebble — non-Pebble MVCCStore stub returns ErrUnsupportedStoreForWriterRegistry at construction. ### main.go (FSM construction site, line 694) Replaces `sm := kv.NewKvFSMWithHLC(st, clock)` with the applier-wired form: reg, err := store.WriterRegistryFor(st) if err != nil { _ = st.Close() return nil, nil, errors.Wrapf(err, ...) } applier, err := encryption.NewApplier(reg) if err != nil { _ = st.Close() return nil, nil, errors.Wrapf(err, ...) } sm := kv.NewKvFSMWithHLC(st, clock, kv.WithEncryption(applier)) Both error paths Close the store before returning so a construction-time failure doesn't leak the Pebble handle. Adds `internal/encryption` to the import list (the only new import in main.go). ## What this commit does NOT do - registerEncryptionAdminServer is unchanged. Mutating EncryptionAdmin RPCs continue to refuse with FailedPrecondition exactly as Stage 5D left them. The existing TestEncryptionAdmin_MutatingRPCRefusedUntilStage6 test (both `sidecar_empty` and `sidecar_set` subcases) passes unchanged. - The applier's ApplyBootstrap / ApplyRotation paths still return ErrKEKNotConfigured. Stage 6B re-enables mutator wiring AND threads in the KEK unwrapper at the same time so a re-enabled mutator never hits the ErrKEKNotConfigured halt path. ## Stage 5D safety boundary The two PR760-merged regression tests pass on this branch: - TestEncryptionAdminFullNodeID_DistinctPerRaftId — PASS - TestEncryptionAdmin_MutatingRPCRefusedUntilStage6/sidecar_empty — PASS - TestEncryptionAdmin_MutatingRPCRefusedUntilStage6/sidecar_set — PASS - TestRegisterEncryptionAdminServer_Registers — PASS The new FSM wiring path does not weaken any of these. The applier exists internally; the gRPC layer continues to refuse the operator-facing mutators that would route into it. ## Five-lens self-review 1. Data loss — pebbleWriterRegistry uses pebble.Sync; FSM apply only returns success after the row is durable. Construction-time error paths Close the Pebble store on failure (no handle leak). 2. Concurrency / distributed failures — dbMu.RLock is sufficient because the apply path's applyMu in kv/fsm.go serialises all 0x03 dispatches; the read- modify-write pattern in Applier.ApplyRegistration is safe under that lock. 3. Performance — one Get + at most one Set per ApplyRegistration. pebble.Sync adds wal-flush cost, bounded. 4. Data consistency — writer-registry rows live outside the MVCC namespace (no commit timestamps, no MVCC visibility). Pebble Sync semantics match the §4.1 contract that a registered writer's first nonce-issuing write under that DEK is durably preceded by its registry row. 5. Test coverage — 3 new store-package unit tests for the adapter; the part-1 applier tests already cover the applier-side dispatch; Stage 5D regression tests confirm no operator-surface change. ## Verification - go test -race -timeout=60s -run 'TestWriterRegistryFor' ./store/... — PASS - go test -race -timeout=60s -run 'TestApply|TestNewApplier' ./internal/encryption/... — PASS - go test -race -timeout=60s -run 'TestEncryptionAdmin|TestRegisterEncryptionAdminServer' . — PASS - go test -race -timeout=180s -run 'TestEncrypt|TestApplier|TestFSM|TestKvFSM|TestWriterRegistry' ./... — PASS - go build ./... — PASS - golangci-lint run ./... — 0 issues on touched files The full adapter test suite has a pre-existing 120s-timeout flake on race-mode (SQS reaper / HLC lease renewal long- running goroutines visible in the dump). The encryption- specific adapter tests pass cleanly when run in isolation; the flake is unrelated to this PR's changes (no adapter code is touched). Will run the suite again before marking the PR ready for review.

…ption wiring (#765) ## Summary First slice of **Stage 6A** per the [PR #762 plan](#762): the load-bearing safety boundary that lands the §6.3 `EncryptionApplier` so future PRs can re-enable mutating-RPC wiring without bricking the cluster. This PR is doc-driven by the merged Stage 6 plan; specifically the 6A row: > §6.3 `EncryptionApplier` concrete implementation (writer-registry insert + sidecar mutate + keystore update; nil-KEK code paths in `ApplyBootstrap` / `ApplyRotation` return a typed `ErrKEKNotConfigured` defense-in-depth marker until 6B fills it) + `kv.NewKvFSMWithHLC(... WithEncryption(applier))` wiring. **Mutator wiring in `registerEncryptionAdminServer` stays OFF, identical to Stage 5D**. ## What this PR ships ### Part 1 (this commit, `e2475b84`) — applier package - **`internal/encryption/applier.go`** (new) — concrete `Applier` satisfying `kv.EncryptionApplier`: - `WriterRegistryStore` interface (abstract Get/Set over the `!encryption|writers|<dek_id>|<uint16(node_id)>` Pebble namespace; kept separate from `store.MVCCStore` because writer-registry rows live outside MVCC). - `ApplyRegistration` implements the §4.1 four-case dispatch (insert / monotonic advance / rollback halt / uint16-collision halt). - `ApplyBootstrap` and `ApplyRotation` return the typed `ErrKEKNotConfigured` defense-in-depth marker — Stage 6B will replace them with the real KEK-unwrap + sidecar + keystore path. - **`internal/encryption/errors.go`** — add `ErrKEKNotConfigured` sentinel with godoc referencing the PR #762 safety-boundary decomposition. - **`internal/encryption/applier_test.go`** (new) — 7 unit tests covering all four §4.1 cases, construction-time nil-registry refusal, store-error propagation, and the 6B-deferred stub contracts. Uses a map-backed `WriterRegistryStore` fake. ### Part 2 (follow-up commit in this PR) — FSM wiring - Concrete `WriterRegistryStore` implementation backed by the FSM's Pebble store (`store/encryption_glue.go` or new file). - `kv.NewKvFSMWithHLC(... kv.WithEncryption(applier))` wiring at `main.go:694`. The Part 1 commit compiles and tests pass on its own; Part 2 plumbs it into production. ## What this PR does NOT do `registerEncryptionAdminServer` is **untouched**. Mutating EncryptionAdmin RPCs continue to refuse with `FailedPrecondition` exactly as Stage 5D left them. The Stage 6 plan's safety invariant — do not re-enable mutator wiring until both the applier AND a KEK unwrapper are present — holds: this PR lands the applier (and the FSM dispatch is no longer a halt path for `ApplyRegistration`), but the KEK plumbing (6B) is still missing. 6B will re-enable the mutator wiring gated on (`--encryption-enabled` AND `KEKConfigured()`), both conditions required. ## Caller audit - `kv.EncryptionApplier` interface — implemented by `*encryption.Applier`. No other callers; the FSM dispatch is the sole consumer. - `internal/encryption/errors.go` exported sentinels — `ErrKEKNotConfigured` is new; no callers yet outside `applier.go` (the kv/fsm dispatch will wrap any non-nil return with `ErrEncryptionApply` regardless of the inner marker). ## Five-lens self-review 1. **Data loss** — applier's writes go through `WriterRegistryStore.SetRegistryRow`, which the production wiring (Part 2) routes through `pebble.Sync`. No silent skip path: every error return is propagated; the `kv/fsm` dispatch marks them with `ErrEncryptionApply` for HaltApply. 2. **Concurrency / distributed failures** — Applier carries no in-memory state between calls; the `WriterRegistryStore` is the single source of truth. FSM apply already serialises `0x03` entries via `applyMu`, so the §4.1 case-dispatch read-modify-write is safe under that lock. 3. **Performance** — `ApplyRegistration` does one Get + at most one Set per call. Production Pebble will add the sync cost; bounded. 4. **Data consistency** — §4.1 case 1/2/3/4 dispatch matches the design doc. Case 3 (rollback) halts apply; the 6C startup guard will prevent it from being reachable in production. Case 4 (uint16 collision) similarly halts and is the per-node backstop for §6.1's cluster-wide uniqueness check. 5. **Test coverage** — 7 unit tests covering all four §4.1 cases, construction refusal, store-error propagation, and the 6B-deferred stub contracts. Map-backed fake mirrors the production semantics. ## Test plan - [x] `go test -race -timeout=60s ./internal/encryption/...` — PASS - [x] `go build ./...` — PASS - [x] `golangci-lint run ./internal/encryption/...` — 0 issues - [ ] After Part 2 commit: `go test -race ./...` (full suite) — Will run before requesting review - [ ] After Part 2 commit: Re-run jepsen-relevant suite (none of 6A's behaviour is operator-visible, so no envelope-cutover Jepsen workload applies; documented in PR description) ## Plan: 6A → 6B → 6C → 6D → 6E → 6F This is the first of six PRs per [PR #762](#762). 6B will follow with the KEK plumbing + mutator gate.  ## Summary by CodeRabbit * **New Features** * Integrated encryption into core system operations with enhanced writer registration and persistence. * Added encryption configuration validation and detailed error reporting for setup issues. * **Tests** * Comprehensive test coverage for encryption functionality, data storage, and integration scenarios. * Verification tests for encryption configuration and startup validation. * **Chores** * Enhanced application startup sequence with encryption support integration.  [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/bootjp/elastickv/pull/765?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)

…on with KEK plumbing (#768) ## Summary Stage 6B part 1 per the [PR #762 plan](#762). Lands the real `ApplyBootstrap` and `ApplyRotation` paths in `internal/encryption.Applier` — KEK-unwrap + Keystore install + crash-durable sidecar mutate + writer-registry batch insert. Replaces the Stage 6A `ErrKEKNotConfigured` stubs. Mutator-RPC wiring in `registerEncryptionAdminServer` stays OFF (Stage 5D posture) so the operator surface is unchanged. The follow-up **6B-2** PR threads `--kekFile` / `--encryption-enabled` flags + startup KEK loader + keystore re-populate + re-enables mutator wiring gated on `(--encryption-enabled AND KEKConfigured())`. ## API additions - **`KEKUnwrapper` interface** — local declaration so the encryption package does not pick up a transitive dependency on the kek package. `kek.Wrapper` satisfies it structurally. - **`ApplierOption` functional-options**: `WithKEK`, `WithKeystore`, `WithSidecarPath`, `WithNowFunc`. The Stage 6A caller `NewApplier(reg)` continues to work byte-for-byte. ## Real ApplyBootstrap (§5.6 step 1a) 1. KEK-unwrap storage + raft wrapped DEKs. 2. `Keystore.Set` both (idempotent on matching bytes; `ErrKeyConflict` on mismatch → halts). 3. `WriteSidecar` with `Active.{Storage,Raft}` set + `keys[]` map updated; crash-durable via §5.1 atomic write. 4. Batch-insert every `BatchRegistry` row via `ApplyRegistration` (§4.1 case-2-idempotent makes replays no-op). Returns `ErrKEKNotConfigured` if any of `WithKEK` / `WithKeystore` / `WithSidecarPath` is missing — the three are an indivisible trio. **Crash-recovery ordering**: `Keystore.Set` (cheap, in-memory) → `WriteSidecar` (durable IO) → batch insert. Recovery is Raft replaying the same entry; every step is idempotent for matching inputs. ## Real ApplyRotation (§5.2 RotateDEK) Handles `RotateSubRotateDEK` only; other sub-tags (rewrap-deks, retire-dek, enable-storage-envelope, enable-raft-envelope) halt with `ErrEncryptionApply` so 6B-1 cannot silently mis-apply later sub-tags. For `RotateSubRotateDEK`: 1. KEK-unwrap the proposed wrapped DEK. 2. `Keystore.Set` at the new DEK ID. 3. `WriteSidecar`: Active slot updated for `Purpose` (Storage or Raft only — unknown Purpose halts); keys[] map updated; old DEKs preserved (retire is a separate Stage 6E sub-tag). 4. `ApplyRegistration` on the proposer's row. ## Tests 8 new applier tests covering happy paths + fail-closed boundaries: | Test | What it pins | |---|---| | `TestApplyBootstrap_ReturnsKEKNotConfigured` | No-options default returns `ErrKEKNotConfigured` | | `TestApplyBootstrap_FullyWired` | Happy path; factor'd per-aspect assertions | | `TestApplyBootstrap_PartialWiring` (4 sub-cases) | All-or-nothing trio invariant | | `TestApplyBootstrap_KEKUnwrapFailure` | Synthetic Unwrap error propagated; no partial install | | `TestApplyBootstrap_Replay_Idempotent` | Same payload twice succeeds | | `TestApplyRotation_ReturnsKEKNotConfigured` | No-options default | | `TestApplyRotation_FullyWired` | Happy path with pre-bootstrap | | `TestApplyRotation_UnknownSubTag` | Halts on `0xFE` sub-tag | | `TestApplyRotation_UnknownPurpose` | Halts on unrecognised Purpose enum | Tests use a `fakeKEK` that pads/truncates wrapped input to `encryption.KeySize` so `Keystore.Set`'s size check passes; same wrapped input always produces the same DEK bytes (deterministic), exercising idempotency. ## Caller audit (semantic change — `NewApplier` signature) `NewApplier(registry)` → `NewApplier(registry, opts...)`. Variadic; the no-options call (Stage 6A test caller) still compiles and behaves identically. Audited: ``` grep -rn 'NewApplier(' --include='*.go' ``` - `main.go:711` — production caller; unchanged in this PR (still passes no opts). 6B-2 will add `WithKEK + WithKeystore + WithSidecarPath`. - `internal/encryption/applier_test.go` — test callers; existing `NewApplier(newMapRegistryStore())` calls still work; new tests pass opts. No other callers. ## Five-lens self-review 1. **Data loss** — `WriteSidecar` uses the existing §5.1 crash-durable write protocol. `Keystore.Set` is in-memory only but the FSM apply path is replayed from Raft on restart. No silent skip: every error propagates through the HaltApply seam. 2. **Concurrency / distributed failures** — Applier carries no per-call mutable state. FSM apply already serialises `0x03/0x04/0x05` dispatches under `applyMu`. Keystore is internally locked. Sidecar IO is single-writer per §5.1. 3. **Performance** — `ApplyBootstrap`: 2 KEK unwraps + 2 `Keystore.Set` + 1 `WriteSidecar` + N writer-registry inserts (N = bootstrap-time cluster size). `ApplyRotation`: 1+1+1+1. Bounded. 4. **Data consistency** — sidecar mutations crash-durable; install order deterministic on replay; case-2-idempotent makes registry replay safe. 5. **Test coverage** — 8 unit tests; both happy paths and fail-closed boundaries exercised. The fakeKEK is deterministic so replay tests are reproducible. ## Test plan - [x] `go test -race -timeout=60s ./internal/encryption/...` — PASS - [x] `go build ./...` — PASS - [x] `golangci-lint run ./internal/encryption/...` — 0 issues - [x] Existing kv + store + main encryption tests — PASS - [ ] Re-run full suite (kv + store + adapter + main + Jepsen) once 6B-2 wires it in production — Will follow with 6B-2 ## Plan This is **Stage 6B part 1**. **Stage 6B-2** will: - Add `--kekFile` / `--kekUri` flags - Add `--encryption-enabled` flag (default off) - Load KEK at startup, populate Keystore from sidecar - Re-enable Proposer + LeaderView in `registerEncryptionAdminServer` gated on (`--encryption-enabled` AND `KEKConfigured()`)  ## Summary by CodeRabbit * **New Features** * Encryption key management bootstrap and rotation operations are now fully operational, including proper key unwrapping, secure keystore installation, and configurable options for encryption setup. * **Tests** * Comprehensive test coverage added for bootstrap and rotation scenarios, including idempotency validation, failure handling, and configuration verification.  [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/bootjp/elastickv/pull/768?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)

…e scope limitation codex r2 P1 on round-2 (commit 0f946c1): the triple-gate encryptionMutatorsEnabled() is purely node-local; during a staged rolling upgrade where some peers still run the pre-6B-2 binary or have the flags unset, a mutator proposal committed by a fully-configured leader would reach those replicas' appliers and HaltApply on the under-configured side. ## Resolution This is a real concern but the cluster-wide readiness check belongs at the §7.1 Phase-1 cutover RPC layer (enable-storage-envelope) in Stage 6D, not in the node-local flag gate. PR #762's milestone breakdown explicitly assigns the Voters ∪ Learners capability fan-out gate to Stage 6D. Adding a partial cluster-wide check in Stage 6B-2 would duplicate the §7.1 capability gate and ship before the operator-facing GetCapability fan-out is wired into the admin CLI. The right resolution is to document the scope limitation explicitly so: 1. Operators reading the code understand the gate's scope and the rolling-upgrade discipline expected at this layer (same constraint as PR #765's 6A→6B caveat). 2. The Stage 6D PR has a clear handoff point — the cluster-wide gate is its scope, and this docstring points forward to that. ## Doc-only No code paths changed; the gate's three-flag readback is unchanged. Single 30-line docstring expansion explaining: - the gate is THIS NODE local - what cluster-wide readiness means at §7.1 - why Stage 6B-2 deliberately defers it to Stage 6D - the operator-discipline constraint matching PR #765 ## Caller audit Pure doc edit. No callers to audit. ## Verification - go build ./... — PASS - go test -race -timeout=60s -run 'TestEncryptionAdmin|TestRegisterEncryptionAdminServer' . — PASS - golangci-lint run ./... — 0 issues on touched files

gemini-code-assist Bot reviewed May 16, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed May 16, 2026

View reviewed changes

bootjp merged commit 6748aca into main May 16, 2026
9 of 10 checks passed

bootjp deleted the feat/encryption-stage6 branch May 16, 2026 17:32

bootjp mentioned this pull request May 16, 2026

feat(encryption): Stage 6A — §6.3 Applier scaffolding + FSM WithEncryption wiring #765

Merged

5 tasks

bootjp mentioned this pull request May 16, 2026

feat(encryption): Stage 6B part 1 — real ApplyBootstrap + ApplyRotation with KEK plumbing #768

Merged

5 tasks

bootjp mentioned this pull request May 16, 2026

feat(encryption): Stage 6B-2 — KEK plumbing + --encryption-enabled flag + mutator-RPC gate #776

Open

4 tasks

	\| 6E \| §6.6 `enable-raft-envelope` admin RPC + §7.1 Phase-2 raft cutover + `raft_envelope_cutover_index` sidecar record + `internal/raftengine/etcd/engine.go` `applyNormalEntry` unwrap hook activation + `ErrRaftUnwrapFailed` HaltApply path \| open \| — \|
	\| 6E \| §6.6 enable-raft-envelope admin RPC + §7.1 Phase-2 raft cutover + raft_envelope_cutover_index sidecar record + internal/raftengine/etcd/engine.go applyNormalEntry unwrap hook activation (§6.3) + ErrRaftUnwrapFailed HaltApply path \| open \| — \|

		them), but the actual shipping dependency is 6A → 6B → 6D
		→ 6E with 6C orthogonal (it strictly extends `--encryption-enabled`

Conversation

bootjp commented May 16, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Stage 6 breakdown

Why doc-first

Five-lens self-review

Test plan

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

bootjp commented May 16, 2026

Uh oh!

claude Bot commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Stage 6 planning doc review

Q1 — Sub-milestone scope coherence

Q2 — Dependency ordering

Q3 — Phase-1 dwell invariant

Status check on 5D/5E changes

Summary

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

bootjp commented May 16, 2026

Round-1 findings addressed

Optional / preventive improvements also applied

Verification

Uh oh!

claude Bot commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Round-2 review

Verdict: all five round-1 findings correctly addressed

One new issue introduced in round-2

Gemini-code-assist suggestion (6E row + zero-downtime)

Summary

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

bootjp commented May 16, 2026

codex round-2 P1

Fix: revise 6A/6B split

claude[bot] round-2 nit

Caller audit

Verification

Uh oh!

claude Bot commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Round-3 review

Verdict: all round-2 findings correctly addressed; one minor precision issue in the 6A rationale

One new precision issue: "reachable only via unit tests" is not quite right

One clarity note: "6C orthogonal... can ship at any point after 6B"

Summary

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

bootjp commented May 16, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 16, 2026 •

edited

Loading

claude Bot commented May 16, 2026 •

edited

Loading

claude Bot commented May 16, 2026 •

edited

Loading

claude Bot commented May 16, 2026 •

edited

Loading

claude Bot commented May 16, 2026 •

edited

Loading

codex r4 P1 (user's commit `abd68cb1`)

claude[bot] r3 low-severity (my commit `4b9a186e`)

claude Bot commented May 16, 2026 •

edited

Loading

claude Bot commented May 16, 2026 •

edited

Loading