Skip to content

docs(encryption): Stage 6 plan — decompose into 6A-6F + record 5D shipped#762

Merged
bootjp merged 6 commits into
mainfrom
feat/encryption-stage6
May 16, 2026
Merged

docs(encryption): Stage 6 plan — decompose into 6A-6F + record 5D shipped#762
bootjp merged 6 commits into
mainfrom
feat/encryption-stage6

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented May 16, 2026

Summary

  • Record Stage 5D (PR feat(encryption): Stage 5 PR-D — main.go gRPC wiring for EncryptionAdmin #760, merged 2026-05-16T16:35:03Z) as shipped in the milestone table.
  • Mark Stage 5E (§5.6 step 1a capability fan-out helper) as deferred — its scope is narrowly the auto-batch CLI convenience and does not gate any Stage 6 work.
  • Decompose the original Stage 6 row (3-phase rollout flags + applier wiring) into six sub-milestones 6A–6F, each independently reviewable, leaving the cluster in a production-safe state, unblocking the next.
  • Add a Stage 6 sub-milestone rationale section explaining why the decomposition follows this shape — why 6A is the load-bearing safety boundary (it lands the §6.3 applier whose absence was PR feat(encryption): Stage 5 PR-D — main.go gRPC wiring for EncryptionAdmin #760 round-3 codex P1), why 6D/6E must be separate cutovers (intermediate Phase-1 storage-only state is a valid operator dwell), and why 6F is the last ergonomics PR.

Stage 6 breakdown

Sub-milestone Scope
6A §6.3 EncryptionApplier concrete implementation (writer-registry insert + sidecar mutate + keystore update) + kv.NewKvFSMWithHLC(... WithEncryption(applier)) wiring + re-enable mutating-RPC wiring in registerEncryptionAdminServer (gated on --encryption-enabled)
6B §6.5 KEK plumbing — --kekFile / --kekUri flags + internal/encryption/kek selection at startup + applier KEK-unwrap path
6C §6.5 --encryption-enabled + §9.1 startup refusal guards (sidecar present without flag, KEK mismatch, sidecar/raft index gap, fsync support, local_epoch rollback / exhaustion, node_id collision)
6D §6.6 enable-storage-envelope admin RPC + §7.1 Phase-1 storage cutover (§6.2 toggle ON) + Voters ∪ Learners capability gate
6E §6.6 enable-raft-envelope admin RPC + §7.1 Phase-2 raft cutover + raft_envelope_cutover_index sidecar record + internal/raftengine/etcd/engine.go applyNormalEntry unwrap hook + ErrRaftUnwrapFailed HaltApply path
6F §6.5 --encryption-rotate-on-startup request flag + leader-elected rotation proposal

Why doc-first

CLAUDE.md design-doc-first workflow: "For any change that goes beyond a single-file edit ... write a *_proposed_*.md design doc first and land it before the implementation." The existing partial doc already covers Stage 6 at the high level — this PR refines the milestone table so each of 6A–6F has a clear, independently-reviewable scope before implementation begins. 6A will land as the next PR.

Five-lens self-review

  1. Data loss — n/a (doc-only).
  2. Concurrency / distributed failures — n/a (doc-only). The breakdown itself is concurrency-safe: 6A leaves the cluster in the same production-inert state PR feat(encryption): Stage 5 PR-D — main.go gRPC wiring for EncryptionAdmin #760 left it in (mutators OFF by default via the new --encryption-enabled gate).
  3. Performance — n/a (doc-only).
  4. Data consistency — n/a (doc-only). The narrative explicitly notes that the Phase-1-only intermediate state (storage encrypted, raft cleartext) is a valid operator dwell, which is the consistency invariant operators rely on for rollback.
  5. Test coverage — n/a (doc-only). Each of 6A–6F will land its own unit + property + Jepsen coverage as appropriate.

Test plan

  • No code touched — go build ./... not required.
  • No code touched — go test ./... not required.
  • No code touched — make lint not required (the local pre-commit hook ran and reported 0 issues).
  • Manually re-read the updated milestone table for internal consistency.

Summary by CodeRabbit

Release Notes

  • Documentation
    • Updated milestone progress tracking for data-at-rest encryption feature development
    • Extended stage breakdown with implementation details for stages 5D through 6F
    • Clarified stage 6 sub-milestone planning and dependencies

Review Change Stack

…-6F)

Stage 5D (PR #760) merged at 2026-05-16T16:35:03Z. Stage 5E
(§5.6 step 1a capability fan-out helper) is marked deferred —
its scope is narrowly the auto-batch CLI convenience and does
not gate any Stage 6 work. The existing CLI accepts the writer
batch explicitly today (PR #759), which is sufficient for
hand-authored bootstraps.

## Stage 6 decomposition

Stage 6 was originally one row in the milestone table ("3-phase
rollout flags + applier wiring", §6.5 + §7.1). The chain of
dependencies makes it impractical to land in a single PR.
This commit decomposes it into 6A through 6F following the
same shipping-order principle as Stage 5 (5A → 5D shipped, 5E
deferred): each milestone is independently reviewable, leaves
the cluster in a production-safe state, and unblocks the next.

- 6A — concrete EncryptionApplier (writer-registry insert +
       sidecar mutate + keystore update) + FSM WithEncryption
       wiring + re-enable Proposer/LeaderView in
       registerEncryptionAdminServer (gated on
       --encryption-enabled). The load-bearing safety
       boundary: PR #760 r3 codex P1 established that the
       mutator wiring is unsafe to land before the applier
       exists.
- 6B — KEK plumbing: --kekFile / --kekUri flags +
       internal/encryption/kek selection at startup + applier
       KEK-unwrap path.
- 6C — --encryption-enabled + §9.1 startup refusal guards
       (sidecar present without flag, KEK mismatch,
       sidecar/raft index gap, fsync support, local_epoch
       rollback / exhaustion, node_id collision).
- 6D — Phase-1 cutover: enable-storage-envelope admin RPC +
       §7.1 storage cutover + Voters ∪ Learners capability
       gate.
- 6E — Phase-2 cutover: enable-raft-envelope admin RPC +
       §7.1 raft cutover + raft_envelope_cutover_index
       sidecar record + raftengine/etcd applyNormalEntry
       unwrap hook + ErrRaftUnwrapFailed HaltApply path.
- 6F — optional --encryption-rotate-on-startup ergonomics
       flag.

The narrative section after the milestone table explains the
rationale for each sub-milestone — why 6A is load-bearing,
why 6D and 6E must be separate cutovers (intermediate
Phase-1-only state is a valid operator dwell), and why 6F is
deferred to last.

## Doc-only PR

No code changes. Reviewable on its own. Each of 6A–6F will
follow as its own PR with its own design constraints +
implementation + tests + Jepsen-relevant suites.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 16, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 31 minutes and 23 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3e51b71e-4249-4315-8e6e-038ce89c9526

📥 Commits

Reviewing files that changed from the base of the PR and between cbaa364 and 9c3dcea.

📒 Files selected for processing (1)
  • docs/design/2026_04_29_partial_data_at_rest_encryption.md
📝 Walkthrough

Walkthrough

The design document for "Data-at-rest encryption for elastickv" was updated to reflect current progress: stages 0–5 are now shipped with stage 5E deferred, and the milestone table was extended with detailed entries for stage 5D, 5E, and stage 6 sub-milestones 6A–6F. The "Stage 6 sub-milestone rationale" section was expanded to explain the dependency-ordered plan and per-sub-milestone scope.

Changes

Data-at-rest encryption design progress

Layer / File(s) Summary
Milestone progress and stage 6 breakdown
docs/design/2026_04_29_partial_data_at_rest_encryption.md
Top-level status updated to reflect stages 0–5 shipped and stage 5E deferred. Milestone table extended with stage 5D (gRPC wiring, shipped), stage 5E (deferred), and stages 6A–6F (open) with detailed implementation scope. Expanded rationale section provides dependency ordering and explains what each sub-milestone enables.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Possibly related PRs

  • bootjp/elastickv#707: Introduces the foundational encryption-at-rest design specification that this PR updates with current milestone progress.

Poem

🐰 Milestones march along the path,
Five stages shipped with steady math,
Stage six now breaks to A through F,
Each one planned with careful heft!
Progress tracked, the way ahead so clear.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main changes: recording Stage 5D as shipped and decomposing Stage 6 into six sub-milestones (6A-6F).
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/encryption-stage6

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 16, 2026

@claude review

Stage 6 planning PR. Doc-only — decomposes the original Stage 6 row into 6A–6F sub-milestones and records 5D shipped. Each of 6A–6F is independently reviewable; 6A is the load-bearing safety boundary (lands the §6.3 applier whose absence was PR #760 r3 codex P1).

Looking for review on:

  1. Sub-milestone scope coherence — does each row in the new table describe an independently-mergeable PR?
  2. Dependency ordering — is the 6A → 6B → 6C → 6D → 6E → 6F sequence correct, or are there cross-cuts I missed?
  3. Phase-1 dwell invariant — the rationale section claims storage-encrypted/raft-cleartext is a valid intermediate state operators can rollback from. Is that consistent with the rest of the design doc (§7.1, §9.1)?

No code paths touched.

@claude
Copy link
Copy Markdown

claude Bot commented May 16, 2026

Claude finished @bootjp's task in 3m 52s —— View job


Stage 6 planning doc review

  • Read PR context and extract review questions
  • Read the full design doc (§6.2, §6.3, §6.5, §6.6, §7.1, §7.2, §9.1)
  • Review the diff
  • Assess sub-milestone scope coherence (Q1)
  • Assess dependency ordering (Q2)
  • Assess Phase-1 dwell invariant consistency (Q3)

Q1 — Sub-milestone scope coherence

The decomposition is solid overall, but three rows have gaps worth tightening before 6A lands as the next PR.

6A: applier + flag, but no KEK unwrapper yet

6A wires WithEncryption(applier) into the FSM and re-enables mutating RPCs behind --encryption-enabled. But the applier's ApplyBootstrap and ApplyRotation paths will need to KEK-unwrap the incoming wrapped DEKs, and that unwrapper doesn't exist until 6B. The rationale correctly notes that the flag defaults off, so this can't be triggered accidentally in production — but the 6A applier's KEK path will be called with a nil unwrapper if anyone sets the flag on a 6A-only binary.

The row should specify whether the 6A applier returns a typed error (e.g., ErrKEKNotConfigured) or panics on a nil KEK. A typed error is the right answer (consistent with the rest of the design's fail-closed stance), and stating it in the row sets the API contract for 6B to implement against. Without this, 6B's "thread a KEKUnwrapper into the applier" could add it as a post-hoc patch rather than a clean interface fill.

6C: flag description implies the flag is new

The 6C row reads §6.5 --encryption-enabled + §9.1 startup refusal guards. A reader scanning the table sees --encryption-enabled first appearing in 6C, but it was already introduced in 6A as a wire-mutators gate. This is a real source of confusion when filing the 6A and 6C PRs: the 6C PR reviewer won't know whether the flag is being introduced or extended, and the 6A PR reviewer may not know that the flag's full semantics are deferred to 6C.

Suggested rewording: extend --encryption-enabled with §9.1 startup refusal guards (KEK mismatch, sidecar/raft index gap, fsync support, local_epoch rollback/exhaustion, node_id collision). The --encryption-enabled already exists in 6A; 6C adds the §9.1 guards around it.

6E: two load-bearing pieces from §6.3 and §7.1 are not listed

The 6E row covers the enable-raft-envelope RPC, the cutover-index sidecar record, the engine hook activation, and ErrRaftUnwrapFailed. But it omits two other pieces that Phase 2 requires:

  1. Coordinator-side wrap-on-propose (kv/coordinator.go / kv/sharded_coordinator.go) — §6.3 explicitly calls this out as a Phase-2 change: "wrap the proposal payload with the raft DEK before handing it to raftengine's submit path." Without the coordinator flipping its wrap-on-propose switch (step 5 of the Phase-2 quiescence sequence in §7.1), the engine-side unwrap hook fires on cleartext entries and breaks every apply after cutover. This is not in 6A (6A wires the FSM apply side, not the propose side) and is not in any other row. It belongs in 6E.

  2. Proposal quiescence barrier (§7.1 Phase 2, steps 1–6 of leader_enable_raft_envelope) — the "block new USER proposal intake, drain in-flight queue, source-tag exemption" sequence is the safety mechanism that prevents cleartext entries from being appended at indexes > raft_envelope_cutover_index while the leader is still cleartext-proposing. The §7.1 text is unambiguous: "Without the barrier ... followers would ... attempt raftDEK.Unwrap on cleartext, and fail every apply with a GCM verification error." This is load-bearing correctness, not ergonomics, and it lives in the 6E PR. The row should name it.

Fix 6C/6E row wording →


Q2 — Dependency ordering

The 6A → 6B → 6C → 6D → 6E → 6F chain is correct for the primary dependencies. One cross-cut worth documenting explicitly in the rationale:

6B → 6D is load-bearing, not just transitive

The current rationale explains 6D as "the actual cutover" and notes it follows 6C, but doesn't call out why 6D must come after 6B specifically. The reason is that §7.1 Phase 1 step 4 (enable-storage-envelope) implicitly triggers the DEK bootstrap if active.storage == 0 — and that bootstrap (ApplyBootstrap) requires the KEK-unwrap path from 6B to create and wrap the initial DEK pair. If 6D were somehow attempted after 6C but without 6B, enable-storage-envelope would panic or return a KEK-not-configured error during the bootstrap sub-step. Stating this in the rationale closes the "why not 6C, 6D, 6B?" question before it gets asked in PR review.

One potential out-of-order observation: could 6F be merged after 6D?

--encryption-rotate-on-startup only requires a working rotation proposal path, which exists after 6A (the applier and mutating RPCs are wired). The flag doesn't require Phase 1 or Phase 2 to be active — it's just a convenience trigger for rotate-dek, which could be called manually from 6A onward. Placing 6F after 6E is the safest choice (no open questions about interaction with the Phase-2 quiescence barrier), and it's the right call for a doc-first plan, but the rationale could acknowledge this flexibility if operators ask to land the ergonomics earlier.


Q3 — Phase-1 dwell invariant

This is the sharpest issue in the PR.

The claim as written is not fully consistent with §7.2

The rationale section says:

"a valid intermediate state that operators can dwell in, observe, and rollback from if needed before committing to Phase 2"

Reading §7.2 ("Why we will not support encrypted → cleartext"):

"Disabling §4.2 would require rewriting WAL entries, which etcd raft does not support. The flag is therefore one-way; the only way back is dump-and-reload."

§7.2 specifically addresses §4.2 (raft envelope). But Phase 1 has a parallel one-way characteristic that §7.2 does not call out explicitly: once enable-storage-envelope has been Raft-committed on any node, the §9.1 guard activates — every node's data dir now contains a sidecar, and §9.1 refuses to start without --encryption-enabled. The elastickv-admin encryption disable command "refuses; documents the dump-and-reload path." So Phase 1 is also one-way under normal operation.

The dwell is valid; the "rollback" framing is not

The Phase-1-only state is genuinely safe as a pause point:

  • All new writes are storage-encrypted, all Raft proposals are still cleartext.
  • A node joining in Phase 1 sees a snapshot with encryption_state = 0b01 values and cleartext Raft entries; both are handled correctly by any encryption-capable binary.
  • Phase 2 has not started, so there is no raft_envelope_cutover_index and no WAL interleaving to worry about.
  • The operator can observe encryption status, run rewrite, and verify zero cleartext versions before deciding whether to proceed to Phase 2.

What operators cannot do is revert Phase 1: the enable-storage-envelope cluster flag is a committed Raft entry, sidecar presence triggers §9.1 startup guards on every node, and encryption disable refuses. The only path back to cleartext is dump-and-reload, same as Phase 2.

Suggested fix

Change the rationale text from "operators can dwell in, observe, and rollback from" to "operators can dwell in and observe indefinitely without committing to Phase 2" — and add a sentence noting that Phase 1 itself is also one-way, consistent with §7.2. This prevents an operator reading the design doc from expecting that "staying in Phase 1" gives them a true rollback option and planning their runbook around an escape hatch that doesn't exist.

Fix Phase-1 dwell wording →


Status check on 5D/5E changes

The 5D status update (shipped, PR #760) and the status header change are accurate. The 5E deferral rationale ("does not gate any Stage 6 work") is correct — the existing explicit-batch CLI from PR #759 is sufficient for hand-authored bootstraps, and the auto-batch convenience can land after 6F without any on-disk or wire-format dependency.


Summary

Area Finding
5D/5E changes Correct, no issues
6A scope Clear, but should specify nil-KEK behavior (typed error, not panic)
6C row Misleading — flag is introduced in 6A, not 6C; reword to "extend"
6E row Missing coordinator wrap-on-propose + proposal quiescence barrier
6A→6F ordering Correct; document 6B→6D as explicit load-bearing link
Phase-1 dwell claim Valid as a pause point; "rollback from" is incorrect — Phase 1 is one-way per §7.2/§9.1

The 6A row and dependency chain are strong enough to start the 6A implementation. The two doc fixes (6C/6E row wording and Phase-1 dwell language) are low-risk edits that can land in this PR or as the first commit of 6A.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the data-at-rest encryption design document, marking Stage 5 as partially shipped and decomposing Stage 6 into six granular sub-milestones (6A-6F) to ensure incremental, safe delivery. A detailed rationale section was added to explain the dependencies and safety boundaries for each sub-stage. Feedback was provided to include specific section references in the milestone table and to expand the document with zero-downtime cutover strategies and operational risk mitigations for the proposed migrations.

| 6B | §6.5 KEK plumbing — `--kekFile` / `--kekUri` flags + `internal/encryption/kek` selection at startup + applier KEK-unwrap path | open | — |
| 6C | §6.5 `--encryption-enabled` + §9.1 startup refusal guards (sidecar present without flag, KEK mismatch, sidecar/raft index gap, filesystem fsync support, `local_epoch` rollback / exhaustion, `node_id` collision) | open | — |
| 6D | §6.6 `enable-storage-envelope` admin RPC + §7.1 Phase-1 storage cutover (§6.2 toggle ON) + Voters ∪ Learners capability gate | open | — |
| 6E | §6.6 `enable-raft-envelope` admin RPC + §7.1 Phase-2 raft cutover + `raft_envelope_cutover_index` sidecar record + `internal/raftengine/etcd/engine.go` `applyNormalEntry` unwrap hook activation + `ErrRaftUnwrapFailed` HaltApply path | open | — |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

For consistency and clarity, consider adding a reference to §6.3 in the scope of this sub-milestone. Furthermore, as this design involves a 'Phase-2 raft cutover' (a production-grade system migration), please ensure the document details strategies for zero-downtime cutover and mitigations for operational risks like rolling upgrade compatibility, as required by repository guidelines.

Suggested change
| 6E | §6.6 `enable-raft-envelope` admin RPC + §7.1 Phase-2 raft cutover + `raft_envelope_cutover_index` sidecar record + `internal/raftengine/etcd/engine.go` `applyNormalEntry` unwrap hook activation + `ErrRaftUnwrapFailed` HaltApply path | open ||
| 6E | §6.6 enable-raft-envelope admin RPC + §7.1 Phase-2 raft cutover + raft_envelope_cutover_index sidecar record + internal/raftengine/etcd/engine.go applyNormalEntry unwrap hook activation (§6.3) + ErrRaftUnwrapFailed HaltApply path | open ||
References
  1. When a design document identifies a significant operational risk, such as the inability to perform rolling upgrades, it must also detail potential mitigation strategies, like implementing a temporary 'bridge' or 'proxy' mode.
  2. When designing a production-grade system migration, the plan must consider and detail strategies for live or zero-downtime cutovers (e.g., blue/green, dual-write proxy, logical replication) to avoid service interruption.

…hase-1 dwell wording

claude[bot] round-1 review on cbaa364 surfaced three
actionable doc improvements (no blocking issues, plan
acceptable to start 6A implementation):

1. **6C row was misleading.** `--encryption-enabled` is
   introduced in 6A as the wire-mutators gate, not in 6C.
   The 6C row read as if the flag were new in 6C, which
   would confuse the 6A vs 6C reviewer split. Rewrote 6C
   as "Extend --encryption-enabled (introduced in 6A) with
   §9.1 startup refusal guards" so the boundary is explicit.

2. **6E row omitted two §7.1 Phase-2 pieces.** The original
   6E row covered the admin RPC, sidecar record, engine
   unwrap hook, and ErrRaftUnwrapFailed — but not the
   coordinator-side wrap-on-propose switch nor the §7.1
   steps 1–6 proposal quiescence barrier. Both are
   load-bearing safety mechanisms for Phase 2: without the
   barrier, cleartext entries appended at indexes > cutover
   would fail every apply with a GCM verification error
   on followers. Added both pieces to the 6E row so the
   6E implementer cannot accidentally ship Phase 2 without
   them.

3. **Phase-1 dwell rationale claimed a non-existent rollback
   option.** The original text said operators could "dwell
   in, observe, and rollback from" Phase 1 before committing
   to Phase 2. §7.2 is unambiguous: encrypted → cleartext is
   not supported under any phase; the only path back is
   dump-and-reload. §9.1's sidecar-present guard activates
   cluster-wide on the first enable-storage-envelope commit,
   so Phase 1 is also one-way. Rewrote to "dwell in and
   observe indefinitely before committing to Phase 2" and
   added an explicit "Phase 1 is not a rollback point"
   note so operators planning their runbook cannot mistake
   the dwell for an escape hatch.

## Additional improvements (claude[bot] optional / preventive)

- **6A nil-KEK contract**: added explicit "returns typed
  ErrKEKNotConfigured through the ErrEncryptionApply
  HaltApply pipeline" wording to the 6A row + rationale
  bullet, so 6B has a clean API contract to implement
  against (drop a KEKUnwrapper in, the existing path
  fills). Documents that ApplyRegistration is fully
  functional in 6A (no KEK needed); only ApplyBootstrap
  and ApplyRotation hit the stub.

- **6D → 6B load-bearing dependency**: added a separate
  rationale bullet calling out that §7.1 Phase 1 step 4
  implicitly triggers the §5.6 DEK bootstrap when
  active.storage == 0, which routes through
  ApplyBootstrap's KEK-unwrap. This is a non-obvious
  cross-cut: the nominal 6C → 6D ordering is insufficient;
  6B must precede 6D regardless of 6C's status. Stating
  this in the rationale closes the "why not 6C, 6D, 6B?"
  question before it gets asked in PR review.

## Caller audit

Documentation-only PR — no code paths touched, no callers
to audit.

## Verification

- git diff: only the design doc changed, 44 insertions /
  8 deletions.
- The local pre-commit hook (make lint) reported 0 issues
  (no code paths to lint).

Plan is now ready to start 6A implementation as the next PR.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 16, 2026

@claude review

Round-2 (commit 1b3cabf7): addressing your round-1 verdict on cbaa364d. All three actionable findings + both optional improvements applied.

Round-1 findings addressed

# Finding Resolution
1 6C row misleading--encryption-enabled introduced in 6A, not 6C 6C row reworded: "Extend --encryption-enabled (introduced in 6A) with §9.1 startup refusal guards"
2 6E row omits coordinator wrap-on-propose + §7.1 quiescence barrier 6E row extended with both pieces — kv/coordinator.go / kv/sharded_coordinator.go wrap-on-propose switch + §7.1 steps 1–6 quiescence barrier (block new intake / drain / source-tag exemption)
3 Phase-1 dwell "rollback from" wrong — Phase 1 is also one-way per §7.2/§9.1 Reworded to "dwell in and observe indefinitely" + added explicit "Phase 1 is not a rollback point" note citing §9.1 sidecar-present guard and §7.2

Optional / preventive improvements also applied

  • 6A nil-KEK API contract — explicitly states ApplyBootstrap / ApplyRotation return typed ErrKEKNotConfigured through the existing ErrEncryptionApply HaltApply pipeline; ApplyRegistration is fully functional in 6A (no KEK needed). 6B's contract is now "drop in a KEKUnwrapper, the path fills."

  • 6D → 6B load-bearing dependency — added rationale bullet explaining §7.1 Phase 1 step 4 implicitly triggers the §5.6 DEK bootstrap which routes through ApplyBootstrap's KEK-unwrap. The nominal 6C → 6D ordering is insufficient; 6B must precede 6D.

Verification

Doc-only — git diff shows 44 insertions / 8 deletions in the design doc, no code paths touched. make lint 0 issues (no code to lint).

@claude
Copy link
Copy Markdown

claude Bot commented May 16, 2026

Claude finished @bootjp's task in 1m 43s —— View job


Round-2 review

  • Read trigger comment and understand round-2 scope
  • Read current state of design doc (commit 1b3cabf7)
  • Verify finding Configure Renovate #1: 6C row reworded to "extend"
  • Verify finding Update dependency go to v1.21.4 #2: 6E row extended with coordinator wrap-on-propose + §7.1 quiescence barrier
  • Verify finding Update docker/setup-qemu-action action to v3 #3: Phase-1 dwell language corrected
  • Verify optional: 6A nil-KEK API contract documented
  • Verify optional: 6D → 6B load-bearing dependency documented
  • Check for any new issues introduced in round-2 edits

Verdict: all five round-1 findings correctly addressed

# Round-1 finding Round-2 resolution Status
1 6C row misleading — --encryption-enabled introduced in 6A "Extend --encryption-enabled (introduced in 6A) with §9.1 startup refusal guards"
2 6E row missing coordinator wrap-on-propose + §7.1 quiescence barrier Both added: kv/coordinator.go / kv/sharded_coordinator.go wrap-on-propose switch + §7.1 steps 1–6 barrier
3 Phase-1 dwell "rollback from" inconsistent with §7.2/§9.1 "dwell in and observe indefinitely" + explicit "Phase 1 is not a rollback point" paragraph citing §9.1 and §7.2
4 (opt) 6A nil-KEK API contract unspecified ErrKEKNotConfigured through ErrEncryptionApply HaltApply pipeline stated in both table row and rationale bullet; ApplyRegistration carve-out noted
5 (opt) 6D → 6B load-bearing link not documented 6D table row + dedicated rationale bullet: §7.1 Phase 1 step 4 → ApplyBootstrap KEK-unwrap → ErrKEKNotConfigured halt without 6B

One new issue introduced in round-2

Counting error in the 6F rationale bullet (docs/design/2026_04_29_partial_data_at_rest_encryption.md, line ~133 in the current file):

"6F lands the convenience flag after the four load-bearing PRs (6A–6E) are in."

6A, 6B, 6C, 6D, 6E is five PRs, not four. The range and the count contradict each other. Suggested fix:

6F lands the convenience flag after the five load-bearing PRs (6A–6E) are in.

This is the only defect in the round-2 edits. Everything else — the milestone table, the rationale section, the Phase-1 dwell language — is internally consistent and accurate.

Fix counting error →


Gemini-code-assist suggestion (6E row + zero-downtime)

The automated review asked for a §6.3 reference in the 6E scope — that reference is now present inline as "(Phase-2 leader-side §6.3 proposal-payload wrap)". The zero-downtime concern is addressed by the quiescence barrier (§7.1 steps 1–6, now in the 6E row) and rolling-upgrade compatibility is gated by the Voters ∪ Learners capability check in 6D — both mechanisms are now named in the table. No further additions needed there.


Summary

The doc is ready to merge after the one-word fix (fourfive). All load-bearing scope descriptions are accurate, the dependency chain is correctly documented, and the Phase-1 dwell invariant is now precisely stated. 6A can begin implementation against this milestone table.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1b3cabf772

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +75 to +78
fills the path; until that lands, any operator who
hand-triggers `BootstrapEncryption` or `RotateDEK` on a
6A-only binary with `--encryption-enabled` set hits a clean,
named error in the gRPC error rather than a process crash.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid treating 6A KEK-missing applies as harmless errors

This section says a 6A-only cluster will return a clean gRPC error if an operator triggers BootstrapEncryption/RotateDEK before 6B, but the current FSM contract is fail-closed: kv/fsm_encryption.go wraps ApplyBootstrap/ApplyRotation errors with ErrEncryptionApply, and internal/raftengine/etcd/engine.go halts apply without advancing setApplied when HaltApply() is non-nil. If such an entry is Raft-committed, replicas can repeatedly halt on that index after restart, so this is an availability-bricking path rather than a recoverable request error.

Useful? React with 👍 / 👎.

Two issues on round-2 (commit 1b3cabf):

## codex P1: 6A re-wiring mutators creates the same cluster-halt class as PR #760

Round-2 wrote: "any operator who hand-triggers BootstrapEncryption
or RotateDEK on a 6A-only binary with --encryption-enabled set
hits a clean, named error in the gRPC error rather than a process
crash." Codex correctly flagged this as wrong:

- The FSM contract is fail-closed: kv/fsm_encryption.go wraps
  ApplyBootstrap/ApplyRotation errors with ErrEncryptionApply.
- raftengine/etcd/engine.go halts apply without advancing
  setApplied when HaltApply() is non-nil.
- So a 6A-only mutator RPC: handler proposes, entry commits,
  every replica's FSM apply returns ErrKEKNotConfigured,
  HaltApply fires, apply loop halts. Restart replays the same
  entry → halts again. Cluster bricks on that index.

This is the same class of failure as PR #760 round-3 codex P1.
The fix for #760 was "don't wire mutators until applier exists";
the principled extension is "don't wire mutators until applier
AND KEK are both present."

## Fix: revise 6A/6B split

- **6A revised**: lands the §6.3 EncryptionApplier as INTERNAL
  scaffolding only. WithEncryption(applier) is wired at FSM
  construction so the dispatch path exists, but
  registerEncryptionAdminServer carries forward Stage 5D's
  posture — Proposer + LeaderView stay unwired, mutating RPCs
  continue to refuse with FailedPrecondition. No flag
  introduced. The applier is reachable only through direct
  FSM.Apply calls in unit tests. 6A's value is unblocking
  6B/6D by providing a real applier to bind KEK and cutover
  work against.

- **6B revised**: lands KEK plumbing AND introduces the
  --encryption-enabled flag AND re-enables the mutator wiring
  gated on (--encryption-enabled AND KEKConfigured()). Both
  conditions must hold, so a binary with the flag set but no
  KEK source loaded continues to refuse mutators at the RPC
  boundary — preventing the cluster-halt path codex identified.

- **6C revised**: now extends the flag "introduced in 6B"
  (not 6A) since the introduction moved to 6B.

- **6D rationale rewritten**: the previously-special "6D
  depends on 6B because §7.1 Phase 1 step 4 implicitly triggers
  ApplyBootstrap" dependency is now a special case of the
  general "6B gates all mutator wiring" rule. Both 6D and 6E
  depend on 6B unconditionally — 6B is what makes any
  EncryptionAdmin mutator reachable. The 6D row drops the
  long inline dependency note since the rationale section
  covers it.

The defense-in-depth ErrKEKNotConfigured marker in the applier
is retained — a future refactor that bypasses the 6B RPC gate
should produce a named grep-able error in logs rather than a
nil pointer dereference. The marker is no longer the production
safety boundary (the RPC gate is); it is a belt-and-suspenders
backstop.

## claude[bot] round-2 nit: "four" → "five"

The 6F rationale bullet said "after the four load-bearing PRs
(6A–6E)". The range 6A–6E is five PRs. Fixed.

## Caller audit

Doc-only — no code paths touched, no callers.

## Verification

- git diff scope: design doc only, 81 insertions / 53 deletions.
- make lint: 0 issues (no code to lint).
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 16, 2026

@claude review

Round-3 (commit a5754f60): addressing codex round-2 P1 (cluster-halt class regression in 6A's plan) + claude[bot] round-2 four→five count.

codex round-2 P1

Codex flagged round-2 wording "any operator who hand-triggers BootstrapEncryption or RotateDEK on a 6A-only binary with --encryption-enabled set hits a clean, named error in the gRPC error rather than a process crash" as wrong:

the current FSM contract is fail-closed: kv/fsm_encryption.go wraps ApplyBootstrap/ApplyRotation errors with ErrEncryptionApply, and internal/raftengine/etcd/engine.go halts apply without advancing setApplied when HaltApply() is non-nil. If such an entry is Raft-committed, replicas can repeatedly halt on that index after restart, so this is an availability-bricking path rather than a recoverable request error.

This is the same cluster-halt class as PR #760 r3 codex P1.

Fix: revise 6A/6B split

  • 6A lands the §6.3 EncryptionApplier as internal scaffolding only. WithEncryption(applier) is wired at FSM construction; registerEncryptionAdminServer carries forward Stage 5D's posture — Proposer + LeaderView stay unwired, mutators refuse with FailedPrecondition. No flag introduced. Reachable only via direct FSM.Apply unit tests.

  • 6B lands KEK plumbing AND introduces --encryption-enabled AND re-enables mutator wiring gated on (--encryption-enabled AND KEKConfigured()). Both conditions must hold, so a binary with the flag set but no KEK source loaded continues to refuse mutators at the RPC boundary.

  • 6C now extends the flag "introduced in 6B" (not 6A).

  • 6D rationale rewritten — the special-case 6D→6B dependency is now a case of the general "6B gates all mutator wiring" rule. Both 6D and 6E depend on 6B unconditionally.

The defense-in-depth ErrKEKNotConfigured marker in the applier is retained as a belt-and-suspenders backstop in case a future refactor bypasses the 6B RPC gate.

claude[bot] round-2 nit

four → five (6A–6E is five PRs). Fixed.

Caller audit

Doc-only — no code paths touched.

Verification

  • git diff: 81 ins / 53 del, design doc only.
  • make lint: 0 issues.

@claude
Copy link
Copy Markdown

claude Bot commented May 16, 2026

Claude finished @bootjp's task in 3m 1s —— View job


Round-3 review

  • Read trigger comment and understand round-3 scope
  • Read current state of design doc (commit a5754f60)
  • Verify codex P1 fix: 6A/6B split revision (cluster-halt prevention)
  • Verify round-2 nit: four→five count
  • Verify 6C now references "introduced in 6B" not "6A"
  • Verify 6D rationale rewrite
  • Check for any new issues introduced in round-3 edits

Verdict: all round-2 findings correctly addressed; one minor precision issue in the 6A rationale

# Round finding Round-3 resolution Status
Codex P1 cluster-halt class if KEK absent when mutators fire 6A: scaffolding only, mutators OFF; 6B: flag + KEK + mutator gate, both conditions required
Claude r2 nit "four load-bearing PRs (6A–6E)" "five load-bearing PRs (6A–6E)" (line 161)
6C flag attribution "introduced in 6A" "(introduced in 6B)"
6D → 6B dependency special-cased to the ApplyBootstrap KEK path generalized: 6B gates all mutator wiring; 6D/6E both depend on 6B unconditionally

One new precision issue: "reachable only via unit tests" is not quite right

docs/design/2026_04_29_partial_data_at_rest_encryption.md, line 75–76 — the 6A rationale bullet reads:

"The applier is reachable only through direct FSM.Apply calls in unit tests; no operator-facing surface changes in 6A."

This is accurate for a same-version 6A cluster, but not in a rolling 6A→6B upgrade. WithEncryption(applier) is wired at FSM construction — the dispatch path is live. A 6A node that receives a committed 0x03/0x04 Raft entry (originating from a 6B+ sibling with --encryption-enabled set and mutators re-enabled) will call ApplyBootstrap/ApplyRotation, return ErrKEKNotConfigured, and halt at that index.

This is the correct behavior — a 6A node cannot safely apply an entry it cannot KEK-unwrap, and the fail-closed halt is what you want — but describing the applier as "only reachable via unit tests" elides the mixed-version case.

Suggested wording:

"The applier is operator-inert in a same-version 6A cluster: no 6A binary can propose 0x03/0x04 entries (mutators are gated OFF), so the FSM dispatch path is exercised only in direct FSM.Apply unit tests. During a rolling 6A→6B upgrade, operators must not set --encryption-enabled until all nodes are on 6B+; a 0x03/0x04 entry committed by a 6B node while a 6A replica is still in the ring will correctly halt the 6A replica with ErrKEKNotConfigured."

Fix wording →

Severity: low — the cluster's behavior is correct (fail-closed halt is right), and a 6A-only cluster has no exposure. This only matters for the rolling upgrade runbook. The 6A PR itself can carry the precise language.


One clarity note: "6C orthogonal... can ship at any point after 6B"

Line 150–151: "the actual shipping dependency is 6A → 6B → 6D → 6E with 6C orthogonal (it strictly extends --encryption-enabled guards and can ship at any point after 6B)"

This is structurally correct (6C has no code dependency on 6D/6E), but the phrase "can ship at any point after 6B" implies 6C could trail 6D. Two of 6C's guards — KEK mismatch and sidecar/raft index gap — become operationally load-bearing precisely during and after the 6D Phase-1 cutover. Running enable-storage-envelope without the KEK-mismatch guard in place means a KEK rotation between Phase 1 and Phase 2 could leave nodes that refuse to start on the next restart with no diagnosable guard. The nominal table order (6C before 6D) is correct; the "orthogonal" note in the rationale could be tightened to avoid implying the trailing-order is safe in practice.

Suggested addendum: "In practice, 6C should ship before 6D — the KEK-mismatch and sidecar/raft-index-gap guards are most load-bearing during the Phase-1 cutover window."

Severity: low — the table order already puts 6C before 6D, and this is a doc-clarity issue rather than a correctness issue.


Summary

The codex P1 fix is correct and complete. The 6A/6B split cleanly separates the FSM-apply scaffolding from the RPC-gate boolean, the double gate (--encryption-enabled AND KEKConfigured()) correctly prevents the cluster-halt class, and the ErrKEKNotConfigured defense-in-depth framing is accurate. The five-count fix and the 6C flag attribution are in. The two items above are low-severity documentation precision issues; neither blocks 6A implementation. The doc is ready to merge as-is, with the precision fixes optionally folded into the 6A implementation PR.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a5754f6027

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +149 to +150
them), but the actual shipping dependency is 6A → 6B → 6D
→ 6E with 6C orthogonal (it strictly extends `--encryption-enabled`
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Make 6C a prerequisite for encryption cutovers

Treating 6C as optional ordering (6A → 6B → 6D → 6E with 6C “orthogonal”) is unsafe because §9.1’s startup refusal checks are what prevent post-cutover downgrade/misconfiguration states (for example starting with a sidecar but --encryption-enabled off, or restarting from stale local_epoch state that can reissue nonces). If 6D/6E ship before those guards, operators can enable envelopes and then restart into states the design elsewhere assumes are blocked, violating the security and correctness invariants this rollout is meant to enforce.

Useful? React with 👍 / 👎.

Codex r4 P1 (line 150): the r3 rationale described 6C (§9.1
startup refusal guards) as orthogonal/optional with the shipping
order 6A → 6B → 6D → 6E, but the §9.1 invariants are what
prevent post-cutover restart misconfigurations. If 6D/6E ship
before 6C is in the binary:

  - An operator who restarts a node with sidecar present but
    --encryption-enabled off silently reads cleartext past an
    envelope cutover (§9.1 sidecar-present-without-flag guard
    not yet wired).
  - A binary without local_epoch rollback / exhaustion guards
    can reissue a stale local_epoch and recycle a previously-
    used GCM nonce (§4.1 invariant violation).
  - KEK mismatch / sidecar-raft index gap / fsync-unsupported /
    node_id collision states all bypass the documented refusal.

All four are security-class regressions the §9.1 guards exist to
prevent, so 6C is a hard predecessor for the cutover milestones,
not a separable cleanup.

Update the milestone-rationale paragraph to "6A → 6B → 6C → 6D
→ 6E with 6C as a hard prerequisite". Add the dependency text
to the 6D and 6E rows of the milestone table. The
6D-implicit-bootstrap rationale (§7.1 Phase 1 step 4 routing
through ApplyBootstrap's KEK-unwrap) stays correct, and 6C is
added as a second hard predecessor alongside 6B rather than
replacing it.

Proposal-only — no code in this PR.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 16, 2026

@claude review

Round 4 — Codex P1 (line 150): r3 described 6C (§9.1 startup refusal guards) as orthogonal/optional with order 6A → 6B → 6D → 6E, but §9.1 invariants prevent post-cutover restart misconfigurations. Without 6C in the binary:

  • sidecar-present-without-flag → silently read cleartext past envelope cutover
  • stale local_epoch → recycle GCM nonce (§4.1 invariant violation)
  • KEK mismatch / index gap / fsync-unsupported / node_id collision bypass refusal

All four are security-class regressions. 6C is now a hard prerequisite for 6D/6E, not orthogonal.

Updated rationale paragraph: "6A → 6B → 6C → 6D → 6E with 6C as a hard prerequisite". Milestone table rows for 6D / 6E both call out the 6C dependency explicitly.

Proposal-only.

@claude
Copy link
Copy Markdown

claude Bot commented May 16, 2026

Claude finished @bootjp's task in 1m 21s —— View job


Round-4 review

  • Read trigger comment and understand round-4 scope
  • Read current state of design doc (commit abd68cb1)
  • Verify Codex P1 fix: 6C elevated to hard prerequisite for 6D/6E
  • Verify 6D/6E rows explicitly call out 6C dependency
  • Verify updated rationale paragraph ordering
  • Check for any new issues introduced in round-4 edits

Verdict: Codex P1 correctly addressed; no new issues introduced

# Finding Round-4 resolution Status
Codex r4 P1 6C treated as "orthogonal" — §9.1 invariants violated if 6D/6E ship without it Ordering rewritten: "6A → 6B → 6C → 6D → 6E with 6C as a hard prerequisite, NOT optional/orthogonal"
6D row Only listed 6B dependency "depends on 6B for mutator wiring AND 6C for §9.1 startup-refusal guards"
6E row Only listed quiescence barrier; no 6C mention "Depends on 6B for mutator wiring AND 6C for §9.1 startup-refusal guards" appended
Rationale bullet "6C orthogonal, can ship at any point after 6B" Removed; replaced with hard-prerequisite explanation including the two specific security regressions
Old "special case" framing 6D-implicit-bootstrap rationale treated as the dependency reason Correctly preserved as a second hard predecessor ("6C-gate adds a second hard predecessor rather than replacing the 6B one")

The Round 4 diff is clean: the "6D and 6E both depend on 6B unconditionally" paragraph is replaced with a full security-class explanation of why 6C is a hard prerequisite — sidecar-present-without-flag silently reads cleartext past an envelope cutover, and a stale local_epoch recycles a GCM nonce (§4.1 invariant violation). Both regressions are exactly the violations Codex r4 P1 named.


One carry-over note from R3 (pre-existing, low severity)

The 6A table row and rationale still read "reachable only through direct FSM.Apply calls in unit tests; no operator-facing surface changes in 6A." As R3 flagged, this is accurate for a same-version 6A cluster but elides the mixed-version case: a 6A replica that receives a Raft-committed 0x03/0x04 entry from a 6B+ sibling will correctly halt with ErrKEKNotConfigured — fail-closed behavior is correct, but the description understates the surface. This was already noted in R3 as "the 6A PR itself can carry the precise language." Still the right call; no change needed in this doc PR.


Summary

The Codex P1 fix is correct and complete. The dependency chain 6A → 6B → 6C → 6D → 6E is now accurately documented in both the milestone table (6D and 6E rows) and the rationale section, with the security rationale for the 6C constraint spelled out. No new defects introduced in Round 4. The doc is ready to merge.

…n-entry caveat

claude[bot] round-3 low-severity (line 75–76): the r3 6A bullet
described the applier as 'reachable only through direct FSM.Apply
calls in unit tests', which is accurate for a same-version 6A
cluster but elides the rolling 6A→6B upgrade case.

Because WithEncryption(applier) is wired in 6A, a 6A node
co-resident with a 6B+ node will dispatch any 0x03/0x04 entry
the 6B leader commits to its (KEK-less) applier and halt with
ErrKEKNotConfigured. The halt itself is fail-closed by design —
a 6A replica cannot safely apply an entry it cannot KEK-unwrap —
but operators must not set --encryption-enabled on a 6B+ node
until every Raft member (voters AND learners, same Voters ∪
Learners gate as §7.1) is on 6B+.

This commit:
- Reworded the 6A row to acknowledge the caveat and point at
  the rationale section for the full story.
- Reworked the 6A rationale bullet to:
  - Replace 'reachable only through... unit tests' with
    'operator-inert in a same-version 6A cluster (exercised
    only via direct FSM.Apply unit tests)' — preserves the
    same-version meaning without claiming the cross-version
    case is unreachable.
  - Add a new 'Rolling 6A→6B upgrade caveat' paragraph naming
    the Voters ∪ Learners gate and pointing forward at 6B's
    runbook.

Doc-only — no code paths touched, no callers.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 16, 2026

@claude review

Round-4: 2 commits — user's parallel abd68cb1 addressing codex r4 P1 (6C as hard prereq) + my 4b9a186e addressing your r3 low-severity rolling-upgrade note.

codex r4 P1 (user's commit abd68cb1)

User addressed the same finding I had locally — "6C orthogonal" was unsafe per §9.1 invariants. Their commit:

  • Updated 6D + 6E rows to add "AND 6C for §9.1 startup-refusal guards"
  • Rewrote the 6D/6E rationale bullet to make 6C unconditional with two concrete examples (sidecar-present-without-flag silent downgrade + local_epoch rollback nonce-reuse).

I reset my local divergent commit and accepted theirs as canonical, since their wording is more direct.

claude[bot] r3 low-severity (my commit 4b9a186e)

Your r3 verdict flagged that the 6A bullet's "reachable only through direct FSM.Apply calls in unit tests" wording elides the rolling 6A→6B upgrade case — a 6A node will dispatch a 6B-committed 0x03/0x04 entry to its KEK-less applier and halt with ErrKEKNotConfigured.

Applied your suggested rewording verbatim:

  • 6A row: now reads "operator-inert in a same-version 6A cluster (exercised only via direct FSM.Apply unit tests). Rolling 6A→6B upgrade has a halt-on-foreign-entry caveat — see rationale"
  • 6A rationale: added a new "Rolling 6A→6B upgrade caveat" paragraph naming the Voters ∪ Learners gate and pointing forward at 6B's runbook for the operator constraint.

Caller audit

Doc-only — no code paths touched.

Verification

  • git diff (vs prior round-3 HEAD a5754f6): 76 ins / 25 del across both commits, design doc only.
  • make lint: 0 issues (no code to lint).

@claude
Copy link
Copy Markdown

claude Bot commented May 16, 2026

Claude finished @bootjp's task in 2m 36s —— View job


Round-5 review

  • Read trigger comment and understand round-5 scope
  • Read current state of design doc (commits abd68cb1 + 4b9a186e)
  • Verify codex r4 P1 fix: 6C as hard prerequisite, 6D/6E rows updated
  • Verify r3 low-severity fix: 6A rolling upgrade caveat added
  • Check for any new issues introduced in round-5 edits

Verdict: all round-4/r4 findings correctly addressed; no new defects

# Finding Round-5 resolution Status
Codex r4 P1 "6C orthogonal" — §9.1 startup guards bypass after 6D/6E is a security-class regression 6D row: "depends on 6B... AND 6C for §9.1 startup-refusal guards"; 6E row: same; rationale rewritten as "6C as a hard prerequisite, NOT optional/orthogonal" with two concrete examples
Codex r4 P1 examples Two specific failure modes should be named sidecar-present-without-flag → silent cleartext read past envelope cutover; stale local_epoch → GCM nonce recycle
Claude r3 low-severity 6A "only reachable via unit tests" elides rolling upgrade case 6A row: "Rolling 6A→6B upgrade has a halt-on-foreign-entry caveat — see rationale"; 6A rationale: new paragraph with Voters ∪ Learners gate, operator constraint, 6B runbook forward-pointer
Operator constraint precision Must hold: every node in every Raft group on 6B+ before enabling flag "must not set --encryption-enabled on a 6B+ node until every member of every Raft group (voters AND learners)" stated

One pre-existing nuance in the 6C rationale bullet (not introduced in round-5)

The 6C rationale bullet reads: "They are independent of the FSM apply path and the cutover RPCs, so they land in their own PR."

This sentence correctly explains why 6C is a separate PR (code-level independence). But read alongside the subsequent bullet ("6C as a hard prerequisite, NOT optional/orthogonal"), it could leave a reader uncertain whether "independent of the cutover RPCs" means "no ordering constraint relative to 6D/6E." The code is independent; the safety ordering is not.

This pre-dates round-5 and was the root cause of the "orthogonal" reading that codex r4 P1 corrected. The subsequent rationale bullet already contains the authoritative statement. Whether to patch the 6C bullet itself or leave the adjacent bullet as the authoritative fix is judgment call — I'd lean toward a one-sentence addendum ("independent in implementation; ordered before 6D/6E for safety — see the next bullet"), but it is not blocking.

Severity: very low / cosmetic — the hard-prerequisite language in the next bullet is unambiguous, and the confusion it risks would have to survive reading two adjacent paragraphs.


No new defects in the round-5 commits

Both commits are precise and narrow:

  • abd68cb1: 6D/6E table rows extended with the 6C dependency clause; rationale bullet rewrites the "6C orthogonal" paragraph as a named prerequisite with security-class failure examples. The two examples (sidecar-present-without-flag, stale local_epoch) are correctly drawn from §9.1 and §4.1 respectively.
  • 4b9a186e: 6A table row shortened to a caveat pointer; 6A rationale acquires a new "Rolling 6A→6B upgrade caveat" paragraph. The constraint is stated with the correct scope (voters AND learners, per §7.1 principle), the fail-closed rationale is stated, and the 6B runbook forward-pointer closes the loop.

The five-count (6A–6E = five PRs) remains correct. The 6C flag attribution ("introduced in 6B") remains correct. All round-1 through round-4 findings remain resolved.

The doc is ready to merge. 6A implementation can begin against this milestone table.

… of cutovers'

claude[bot] r5 cosmetic note (judgment call, not blocking): the
6C rationale bullet read 'They are independent of the FSM apply
path and the cutover RPCs, so they land in their own PR.' That
'independent of the cutover RPCs' phrase, read on its own,
could imply 'no ordering constraint relative to 6D/6E' — which
is exactly the wrong reading codex r4 P1 corrected. The
adjacent bullet contains the authoritative hard-prerequisite
statement, but a one-sentence addendum here removes the
possible confusion before the reader gets to it.

Fix: clarify that the independence is at the implementation
level only — the safety ordering is 6C before 6D/6E — and
point forward at the next bullet for the security rationale.

Doc-only — no code paths touched, no callers.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 16, 2026

@claude review

Round-5 (commit 9c3dcea8): addressing the one cosmetic nuance from your r5 verdict.

r5 cosmetic note

Your r5 verdict said the doc was ready to merge but flagged this 6C rationale sentence as a judgment-call clarification:

"They are independent of the FSM apply path and the cutover RPCs, so they land in their own PR."

The phrase "independent of the cutover RPCs" — read on its own — could imply "no ordering constraint relative to 6D/6E", which is precisely the wrong reading codex r4 P1 corrected.

Fix

Reworded to clarify the independence is implementation-level only, and explicitly point forward at the next bullet for the safety ordering:

"The guard code is independent of the FSM apply path and the
cutover RPCs (it runs at startup, before either has a chance
to fire), so 6C lands in its own PR. The independence is at
the implementation level only — the safety ordering is 6C
before 6D/6E
, see the next bullet for why."

Removes the possible misreading before it reaches the reader.

Caller audit

Doc-only — no code paths touched.

Verification

  • git diff: 6 insertions / 3 deletions, design doc only.
  • make lint: 0 issues.

@claude
Copy link
Copy Markdown

claude Bot commented May 16, 2026

Claude finished @bootjp's task in 51s —— View job


Round-6 review

  • Read trigger comment and understand round-6 scope
  • Read current state of design doc (commit 9c3dcea8)
  • Verify r5 cosmetic fix: 6C rationale sentence reworded
  • Check for any new issues introduced in round-6 edits

Verdict: r5 cosmetic fix correctly applied; no new issues

# Finding Round-6 resolution Status
Claude r5 cosmetic "independent of the cutover RPCs" implied no ordering constraint vs 6D/6E Reworded with implementation-vs-safety-ordering distinction + forward pointer to the next bullet

The diff is exactly 6 insertions / 3 deletions in the 6C rationale bullet, as stated. The new text is strictly better than my r5 suggestion — the parenthetical "(it runs at startup, before either has a chance to fire)" I did not specify but is a concrete mechanistic explanation for why 6C's code is decoupled from the cutover RPCs. A reader who wonders "what does independent mean here?" gets an immediate answer without having to infer it from the FSM/RPC structure.

The two-sentence structure now reads cleanly:

  1. Why 6C is its own PR — implementation independence (startup-path code, not FSM-apply or cutover-RPC code).
  2. Why that independence does not relax the ordering — bolded constraint pointing at the next bullet for the security rationale.

The adjacent bullet ("6C as a hard prerequisite, NOT optional/orthogonal") remains the authoritative statement of the safety ordering. The r5 fix now makes the 6C bullet self-consistent rather than relying on the reader to synthesize across two paragraphs.

No carry-over issues remain. The cumulative state of all six rounds:

Round Key finding Status
r1 6C row misleading; 6E missing coordinator wrap-on-propose + quiescence barrier; Phase-1 "rollback" wrong All resolved in r2
r2 Counting error: four→five (6A–6E) Resolved in r3
r3 6A "only via unit tests" elides rolling upgrade case Resolved in r4
r4 6C orthogonal/optional is a security-class safety violation; 6C must be a hard prereq for 6D/6E Resolved in r4 (codex P1) + r4 (claude caveat)
r5 6C independence sentence could be misread as "no ordering constraint" Resolved in r6 (this commit)

The doc is ready to merge. 6A implementation can begin against this milestone table.

@bootjp bootjp merged commit 6748aca into main May 16, 2026
9 of 10 checks passed
@bootjp bootjp deleted the feat/encryption-stage6 branch May 16, 2026 17:32
bootjp added a commit that referenced this pull request May 16, 2026
… + FSM WithEncryption wiring

Second slice of Stage 6A per the PR #762 plan. Wires the
§6.3 Applier (landed in part 1) into the FSM construction
path with a Pebble-backed WriterRegistryStore adapter so the
0x03 RegisterEncryptionWriter dispatch path is fully
functional end-to-end. Mutator-RPC wiring in
registerEncryptionAdminServer remains OFF per Stage 5D's
posture — re-enabled in Stage 6B alongside KEK plumbing
gated on (--encryption-enabled AND KEKConfigured()).

## What this commit ships

### store/encryption_glue.go (additions)

- pebbleWriterRegistry — internal adapter satisfying
  encryption.WriterRegistryStore. Both Get and Set hold
  dbMu for read only (the apply path's applyMu in
  kv/fsm.go is the authoritative serialiser for the
  §4.1 case-dispatch read-modify-write). Set uses
  pebble.Sync so the row is durable before FSM apply
  returns — a crash between FSM dispatch returning
  success and Pebble flushing the WAL would otherwise
  leave the Raft log claiming the entry applied but
  the registry row missing on restart, violating §4.1
  case 2's monotonic-epoch invariant.

- WriterRegistryFor(MVCCStore) factory — type-asserts
  against *pebbleStore and returns the adapter, or
  ErrUnsupportedStoreForWriterRegistry if the store is
  not the Pebble-backed implementation. Construction-time
  refusal so the failure surface is "binary refuses to
  start" rather than "FSM apply panics inside Raft loop".

- ErrUnsupportedStoreForWriterRegistry — new sentinel.

### store/writer_registry_test.go (new)

Three table-driven unit tests:

- TestWriterRegistryFor_RoundTrip — fresh Pebble store,
  missing row reads as (nil, false, nil), Set writes,
  Get returns byte-identical value.
- TestWriterRegistryFor_Overwrite — Set on existing key
  replaces the value; no MVCC versioning, no append.
- TestWriterRegistryFor_RejectsNonPebble — non-Pebble
  MVCCStore stub returns ErrUnsupportedStoreForWriterRegistry
  at construction.

### main.go (FSM construction site, line 694)

Replaces `sm := kv.NewKvFSMWithHLC(st, clock)` with the
applier-wired form:

  reg, err := store.WriterRegistryFor(st)
  if err != nil {
      _ = st.Close()
      return nil, nil, errors.Wrapf(err, ...)
  }
  applier, err := encryption.NewApplier(reg)
  if err != nil {
      _ = st.Close()
      return nil, nil, errors.Wrapf(err, ...)
  }
  sm := kv.NewKvFSMWithHLC(st, clock, kv.WithEncryption(applier))

Both error paths Close the store before returning so a
construction-time failure doesn't leak the Pebble handle.
Adds `internal/encryption` to the import list (the only
new import in main.go).

## What this commit does NOT do

- registerEncryptionAdminServer is unchanged. Mutating
  EncryptionAdmin RPCs continue to refuse with
  FailedPrecondition exactly as Stage 5D left them.
  The existing TestEncryptionAdmin_MutatingRPCRefusedUntilStage6
  test (both `sidecar_empty` and `sidecar_set` subcases)
  passes unchanged.

- The applier's ApplyBootstrap / ApplyRotation paths still
  return ErrKEKNotConfigured. Stage 6B re-enables mutator
  wiring AND threads in the KEK unwrapper at the same
  time so a re-enabled mutator never hits the
  ErrKEKNotConfigured halt path.

## Stage 5D safety boundary

The two PR760-merged regression tests pass on this branch:

- TestEncryptionAdminFullNodeID_DistinctPerRaftId — PASS
- TestEncryptionAdmin_MutatingRPCRefusedUntilStage6/sidecar_empty — PASS
- TestEncryptionAdmin_MutatingRPCRefusedUntilStage6/sidecar_set — PASS
- TestRegisterEncryptionAdminServer_Registers — PASS

The new FSM wiring path does not weaken any of these. The
applier exists internally; the gRPC layer continues to
refuse the operator-facing mutators that would route into
it.

## Five-lens self-review

1. Data loss — pebbleWriterRegistry uses pebble.Sync; FSM
   apply only returns success after the row is durable.
   Construction-time error paths Close the Pebble store
   on failure (no handle leak).
2. Concurrency / distributed failures — dbMu.RLock is
   sufficient because the apply path's applyMu in
   kv/fsm.go serialises all 0x03 dispatches; the read-
   modify-write pattern in Applier.ApplyRegistration is
   safe under that lock.
3. Performance — one Get + at most one Set per
   ApplyRegistration. pebble.Sync adds wal-flush cost,
   bounded.
4. Data consistency — writer-registry rows live outside
   the MVCC namespace (no commit timestamps, no MVCC
   visibility). Pebble Sync semantics match the §4.1
   contract that a registered writer's first nonce-issuing
   write under that DEK is durably preceded by its
   registry row.
5. Test coverage — 3 new store-package unit tests for the
   adapter; the part-1 applier tests already cover the
   applier-side dispatch; Stage 5D regression tests
   confirm no operator-surface change.

## Verification

- go test -race -timeout=60s -run 'TestWriterRegistryFor' ./store/... — PASS
- go test -race -timeout=60s -run 'TestApply|TestNewApplier' ./internal/encryption/... — PASS
- go test -race -timeout=60s -run 'TestEncryptionAdmin|TestRegisterEncryptionAdminServer' . — PASS
- go test -race -timeout=180s -run 'TestEncrypt|TestApplier|TestFSM|TestKvFSM|TestWriterRegistry' ./... — PASS
- go build ./... — PASS
- golangci-lint run ./... — 0 issues on touched files

The full adapter test suite has a pre-existing 120s-timeout
flake on race-mode (SQS reaper / HLC lease renewal long-
running goroutines visible in the dump). The encryption-
specific adapter tests pass cleanly when run in isolation;
the flake is unrelated to this PR's changes (no adapter
code is touched). Will run the suite again before marking
the PR ready for review.
bootjp added a commit that referenced this pull request May 16, 2026
…ption wiring (#765)

## Summary

First slice of **Stage 6A** per the [PR #762
plan](#762): the load-bearing
safety boundary that lands the §6.3 `EncryptionApplier` so future PRs
can re-enable mutating-RPC wiring without bricking the cluster.

This PR is doc-driven by the merged Stage 6 plan; specifically the 6A
row:

> §6.3 `EncryptionApplier` concrete implementation (writer-registry
insert + sidecar mutate + keystore update; nil-KEK code paths in
`ApplyBootstrap` / `ApplyRotation` return a typed `ErrKEKNotConfigured`
defense-in-depth marker until 6B fills it) + `kv.NewKvFSMWithHLC(...
WithEncryption(applier))` wiring. **Mutator wiring in
`registerEncryptionAdminServer` stays OFF, identical to Stage 5D**.

## What this PR ships

### Part 1 (this commit, `e2475b84`) — applier package

- **`internal/encryption/applier.go`** (new) — concrete `Applier`
satisfying `kv.EncryptionApplier`:
- `WriterRegistryStore` interface (abstract Get/Set over the
`!encryption|writers|<dek_id>|<uint16(node_id)>` Pebble namespace; kept
separate from `store.MVCCStore` because writer-registry rows live
outside MVCC).
- `ApplyRegistration` implements the §4.1 four-case dispatch (insert /
monotonic advance / rollback halt / uint16-collision halt).
- `ApplyBootstrap` and `ApplyRotation` return the typed
`ErrKEKNotConfigured` defense-in-depth marker — Stage 6B will replace
them with the real KEK-unwrap + sidecar + keystore path.

- **`internal/encryption/errors.go`** — add `ErrKEKNotConfigured`
sentinel with godoc referencing the PR #762 safety-boundary
decomposition.

- **`internal/encryption/applier_test.go`** (new) — 7 unit tests
covering all four §4.1 cases, construction-time nil-registry refusal,
store-error propagation, and the 6B-deferred stub contracts. Uses a
map-backed `WriterRegistryStore` fake.

### Part 2 (follow-up commit in this PR) — FSM wiring

- Concrete `WriterRegistryStore` implementation backed by the FSM's
Pebble store (`store/encryption_glue.go` or new file).
- `kv.NewKvFSMWithHLC(... kv.WithEncryption(applier))` wiring at
`main.go:694`.

The Part 1 commit compiles and tests pass on its own; Part 2 plumbs it
into production.

## What this PR does NOT do

`registerEncryptionAdminServer` is **untouched**. Mutating
EncryptionAdmin RPCs continue to refuse with `FailedPrecondition`
exactly as Stage 5D left them. The Stage 6 plan's safety invariant — do
not re-enable mutator wiring until both the applier AND a KEK unwrapper
are present — holds: this PR lands the applier (and the FSM dispatch is
no longer a halt path for `ApplyRegistration`), but the KEK plumbing
(6B) is still missing. 6B will re-enable the mutator wiring gated on
(`--encryption-enabled` AND `KEKConfigured()`), both conditions
required.

## Caller audit

- `kv.EncryptionApplier` interface — implemented by
`*encryption.Applier`. No other callers; the FSM dispatch is the sole
consumer.
- `internal/encryption/errors.go` exported sentinels —
`ErrKEKNotConfigured` is new; no callers yet outside `applier.go` (the
kv/fsm dispatch will wrap any non-nil return with `ErrEncryptionApply`
regardless of the inner marker).

## Five-lens self-review

1. **Data loss** — applier's writes go through
`WriterRegistryStore.SetRegistryRow`, which the production wiring (Part
2) routes through `pebble.Sync`. No silent skip path: every error return
is propagated; the `kv/fsm` dispatch marks them with
`ErrEncryptionApply` for HaltApply.
2. **Concurrency / distributed failures** — Applier carries no in-memory
state between calls; the `WriterRegistryStore` is the single source of
truth. FSM apply already serialises `0x03` entries via `applyMu`, so the
§4.1 case-dispatch read-modify-write is safe under that lock.
3. **Performance** — `ApplyRegistration` does one Get + at most one Set
per call. Production Pebble will add the sync cost; bounded.
4. **Data consistency** — §4.1 case 1/2/3/4 dispatch matches the design
doc. Case 3 (rollback) halts apply; the 6C startup guard will prevent it
from being reachable in production. Case 4 (uint16 collision) similarly
halts and is the per-node backstop for §6.1's cluster-wide uniqueness
check.
5. **Test coverage** — 7 unit tests covering all four §4.1 cases,
construction refusal, store-error propagation, and the 6B-deferred stub
contracts. Map-backed fake mirrors the production semantics.

## Test plan

- [x] `go test -race -timeout=60s ./internal/encryption/...` — PASS
- [x] `go build ./...` — PASS
- [x] `golangci-lint run ./internal/encryption/...` — 0 issues
- [ ] After Part 2 commit: `go test -race ./...` (full suite) — Will run
before requesting review
- [ ] After Part 2 commit: Re-run jepsen-relevant suite (none of 6A's
behaviour is operator-visible, so no envelope-cutover Jepsen workload
applies; documented in PR description)

## Plan: 6A → 6B → 6C → 6D → 6E → 6F

This is the first of six PRs per [PR
#762](#762). 6B will follow with
the KEK plumbing + mutator gate.


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
* Integrated encryption into core system operations with enhanced writer
registration and persistence.
* Added encryption configuration validation and detailed error reporting
for setup issues.

* **Tests**
* Comprehensive test coverage for encryption functionality, data
storage, and integration scenarios.
* Verification tests for encryption configuration and startup
validation.

* **Chores**
* Enhanced application startup sequence with encryption support
integration.

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/bootjp/elastickv/pull/765?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)

<!-- review_stack_entry_end -->

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
bootjp added a commit that referenced this pull request May 16, 2026
…on with KEK plumbing (#768)

## Summary

Stage 6B part 1 per the [PR #762
plan](#762). Lands the real
`ApplyBootstrap` and `ApplyRotation` paths in
`internal/encryption.Applier` — KEK-unwrap + Keystore install +
crash-durable sidecar mutate + writer-registry batch insert. Replaces
the Stage 6A `ErrKEKNotConfigured` stubs.

Mutator-RPC wiring in `registerEncryptionAdminServer` stays OFF (Stage
5D posture) so the operator surface is unchanged. The follow-up **6B-2**
PR threads `--kekFile` / `--encryption-enabled` flags + startup KEK
loader + keystore re-populate + re-enables mutator wiring gated on
`(--encryption-enabled AND KEKConfigured())`.

## API additions

- **`KEKUnwrapper` interface** — local declaration so the encryption
package does not pick up a transitive dependency on the kek package.
`kek.Wrapper` satisfies it structurally.
- **`ApplierOption` functional-options**: `WithKEK`, `WithKeystore`,
`WithSidecarPath`, `WithNowFunc`. The Stage 6A caller `NewApplier(reg)`
continues to work byte-for-byte.

## Real ApplyBootstrap (§5.6 step 1a)

1. KEK-unwrap storage + raft wrapped DEKs.
2. `Keystore.Set` both (idempotent on matching bytes; `ErrKeyConflict`
on mismatch → halts).
3. `WriteSidecar` with `Active.{Storage,Raft}` set + `keys[]` map
updated; crash-durable via §5.1 atomic write.
4. Batch-insert every `BatchRegistry` row via `ApplyRegistration` (§4.1
case-2-idempotent makes replays no-op).

Returns `ErrKEKNotConfigured` if any of `WithKEK` / `WithKeystore` /
`WithSidecarPath` is missing — the three are an indivisible trio.

**Crash-recovery ordering**: `Keystore.Set` (cheap, in-memory) →
`WriteSidecar` (durable IO) → batch insert. Recovery is Raft replaying
the same entry; every step is idempotent for matching inputs.

## Real ApplyRotation (§5.2 RotateDEK)

Handles `RotateSubRotateDEK` only; other sub-tags (rewrap-deks,
retire-dek, enable-storage-envelope, enable-raft-envelope) halt with
`ErrEncryptionApply` so 6B-1 cannot silently mis-apply later sub-tags.

For `RotateSubRotateDEK`:
1. KEK-unwrap the proposed wrapped DEK.
2. `Keystore.Set` at the new DEK ID.
3. `WriteSidecar`: Active slot updated for `Purpose` (Storage or Raft
only — unknown Purpose halts); keys[] map updated; old DEKs preserved
(retire is a separate Stage 6E sub-tag).
4. `ApplyRegistration` on the proposer's row.

## Tests

8 new applier tests covering happy paths + fail-closed boundaries:

| Test | What it pins |
|---|---|
| `TestApplyBootstrap_ReturnsKEKNotConfigured` | No-options default
returns `ErrKEKNotConfigured` |
| `TestApplyBootstrap_FullyWired` | Happy path; factor'd per-aspect
assertions |
| `TestApplyBootstrap_PartialWiring` (4 sub-cases) | All-or-nothing trio
invariant |
| `TestApplyBootstrap_KEKUnwrapFailure` | Synthetic Unwrap error
propagated; no partial install |
| `TestApplyBootstrap_Replay_Idempotent` | Same payload twice succeeds |
| `TestApplyRotation_ReturnsKEKNotConfigured` | No-options default |
| `TestApplyRotation_FullyWired` | Happy path with pre-bootstrap |
| `TestApplyRotation_UnknownSubTag` | Halts on `0xFE` sub-tag |
| `TestApplyRotation_UnknownPurpose` | Halts on unrecognised Purpose
enum |

Tests use a `fakeKEK` that pads/truncates wrapped input to
`encryption.KeySize` so `Keystore.Set`'s size check passes; same wrapped
input always produces the same DEK bytes (deterministic), exercising
idempotency.

## Caller audit (semantic change — `NewApplier` signature)

`NewApplier(registry)` → `NewApplier(registry, opts...)`. Variadic; the
no-options call (Stage 6A test caller) still compiles and behaves
identically. Audited:

```
grep -rn 'NewApplier(' --include='*.go'
```
- `main.go:711` — production caller; unchanged in this PR (still passes
no opts). 6B-2 will add `WithKEK + WithKeystore + WithSidecarPath`.
- `internal/encryption/applier_test.go` — test callers; existing
`NewApplier(newMapRegistryStore())` calls still work; new tests pass
opts.

No other callers.

## Five-lens self-review

1. **Data loss** — `WriteSidecar` uses the existing §5.1 crash-durable
write protocol. `Keystore.Set` is in-memory only but the FSM apply path
is replayed from Raft on restart. No silent skip: every error propagates
through the HaltApply seam.
2. **Concurrency / distributed failures** — Applier carries no per-call
mutable state. FSM apply already serialises `0x03/0x04/0x05` dispatches
under `applyMu`. Keystore is internally locked. Sidecar IO is
single-writer per §5.1.
3. **Performance** — `ApplyBootstrap`: 2 KEK unwraps + 2 `Keystore.Set`
+ 1 `WriteSidecar` + N writer-registry inserts (N = bootstrap-time
cluster size). `ApplyRotation`: 1+1+1+1. Bounded.
4. **Data consistency** — sidecar mutations crash-durable; install order
deterministic on replay; case-2-idempotent makes registry replay safe.
5. **Test coverage** — 8 unit tests; both happy paths and fail-closed
boundaries exercised. The fakeKEK is deterministic so replay tests are
reproducible.

## Test plan

- [x] `go test -race -timeout=60s ./internal/encryption/...` — PASS
- [x] `go build ./...` — PASS
- [x] `golangci-lint run ./internal/encryption/...` — 0 issues
- [x] Existing kv + store + main encryption tests — PASS
- [ ] Re-run full suite (kv + store + adapter + main + Jepsen) once 6B-2
wires it in production — Will follow with 6B-2

## Plan

This is **Stage 6B part 1**. **Stage 6B-2** will:
- Add `--kekFile` / `--kekUri` flags
- Add `--encryption-enabled` flag (default off)
- Load KEK at startup, populate Keystore from sidecar
- Re-enable Proposer + LeaderView in `registerEncryptionAdminServer`
gated on (`--encryption-enabled` AND `KEKConfigured()`)


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
* Encryption key management bootstrap and rotation operations are now
fully operational, including proper key unwrapping, secure keystore
installation, and configurable options for encryption setup.

* **Tests**
* Comprehensive test coverage added for bootstrap and rotation
scenarios, including idempotency validation, failure handling, and
configuration verification.

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/bootjp/elastickv/pull/768?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)

<!-- review_stack_entry_end -->

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
bootjp added a commit that referenced this pull request May 16, 2026
…e scope limitation

codex r2 P1 on round-2 (commit 0f946c1): the triple-gate
encryptionMutatorsEnabled() is purely node-local; during a
staged rolling upgrade where some peers still run the pre-6B-2
binary or have the flags unset, a mutator proposal committed
by a fully-configured leader would reach those replicas'
appliers and HaltApply on the under-configured side.

## Resolution

This is a real concern but the cluster-wide readiness check
belongs at the §7.1 Phase-1 cutover RPC layer
(enable-storage-envelope) in Stage 6D, not in the node-local
flag gate. PR #762's milestone breakdown explicitly assigns
the Voters ∪ Learners capability fan-out gate to Stage 6D.

Adding a partial cluster-wide check in Stage 6B-2 would
duplicate the §7.1 capability gate and ship before the
operator-facing GetCapability fan-out is wired into the
admin CLI. The right resolution is to document the scope
limitation explicitly so:

  1. Operators reading the code understand the gate's scope
     and the rolling-upgrade discipline expected at this
     layer (same constraint as PR #765's 6A→6B caveat).
  2. The Stage 6D PR has a clear handoff point — the
     cluster-wide gate is its scope, and this docstring
     points forward to that.

## Doc-only

No code paths changed; the gate's three-flag readback is
unchanged. Single 30-line docstring expansion explaining:

  - the gate is THIS NODE local
  - what cluster-wide readiness means at §7.1
  - why Stage 6B-2 deliberately defers it to Stage 6D
  - the operator-discipline constraint matching PR #765

## Caller audit

Pure doc edit. No callers to audit.

## Verification

- go build ./... — PASS
- go test -race -timeout=60s -run 'TestEncryptionAdmin|TestRegisterEncryptionAdminServer' . — PASS
- golangci-lint run ./... — 0 issues on touched files
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant