feat(encryption): Stage 6B part 1 — real ApplyBootstrap + ApplyRotation with KEK plumbing#768
Conversation
…on with KEK / Keystore / Sidecar plumbing First slice of Stage 6B per the PR #762 plan. Threads KEKUnwrapper + Keystore + sidecar path into the Applier via functional options and replaces the Stage 6A ErrKEKNotConfigured stubs with real §5.6 step 1a + §5.2 RotateDEK apply paths. The main.go wiring (flag introduction + RPC mutator gate + startup KEK loader + startup keystore re-populate) lands in the follow-up PR (6B-2); this commit ships only the applier internals + a fakeKEK-based test surface so the apply path is fully exercised end-to-end without touching main.go. ## API additions (internal/encryption/applier.go) - KEKUnwrapper interface — local declaration so encryption does not pick up a transitive dependency on the kek package. kek.Wrapper satisfies it structurally. - ApplierOption functional-option type plus four options: WithKEK(KEKUnwrapper) WithKeystore(*Keystore) WithSidecarPath(string) WithNowFunc(func() time.Time) WithNowFunc keeps tests deterministic; the production default is time.Now and feeds the sidecar's Created diagnostic field (not used for ordering — see §5.1). - NewApplier signature extended to variadic opts. The Stage 6A test caller NewApplier(reg) continues to work byte-for-byte; the production caller in main.go (6B-2) will pass WithKEK + WithKeystore + WithSidecarPath. ## ApplyBootstrap (real, §5.6 step 1a) Replaces the ErrKEKNotConfigured stub with: 1. KEK-unwrap storage + raft wrapped DEKs. 2. Keystore.Set both (idempotent on matching bytes; ErrKeyConflict on mismatch — caught by the dispatch). 3. WriteSidecar with Active.Storage / Active.Raft set + keys[] map updated for both DEK IDs. Crash-durable via the existing §5.1 atomic write protocol. 4. Batch-insert every BatchRegistry row via ApplyRegistration (the §4.1 case-2-idempotent path makes second-pass replays no-op). Returns ErrKEKNotConfigured if any of WithKEK / WithKeystore / WithSidecarPath are missing — the three are an indivisible trio (a partial wiring would either compute DEK bytes that get discarded, install bytes that vanish on restart, or record wrapped DEKs the local node cannot decrypt). Crash-recovery ordering: Keystore.Set fires first (in-memory, cheap), then WriteSidecar (durable IO), then batch insert. A crash anywhere is recovered by Raft replaying the same entry — Keystore.Set is idempotent for matching bytes, WriteSidecar is idempotent for the same payload, and ApplyRegistration's case-2-idempotent path makes the batch re-insert no-op. ## ApplyRotation (real, §5.2 RotateDEK) Handles RotateSubRotateDEK only; other sub-tags (rewrap-deks, retire-dek, enable-storage-envelope, enable-raft-envelope) halt with ErrEncryptionApply so Stage 6B-1 cannot silently mis-apply a later sub-tag. For RotateSubRotateDEK: 1. KEK-unwrap the proposed wrapped DEK. 2. Keystore.Set at the new DEK ID. 3. WriteSidecar: Active slot updated for Purpose (Storage or Raft only — unknown Purpose halts with ErrEncryptionApply); keys[] map updated for the new DEK ID; old DEKs preserved (retire is a separate sub-tag in Stage 6E). 4. ApplyRegistration on the proposer's row so §4.1 case-2 monotonicity covers their first encrypted write under the new DEK. Same fail-closed wiring-check as ApplyBootstrap. ## Tests (internal/encryption/applier_test.go) 8 new tests covering both happy paths and fail-closed boundaries: - TestApplyBootstrap_ReturnsKEKNotConfigured (existing, reframed) — no-options default still returns the defense-in-depth marker. - TestApplyBootstrap_FullyWired — happy path with factor'd per-aspect assertions (keystore / sidecar / registry rows). - TestApplyBootstrap_PartialWiring — 4 sub-cases pinning the all-or-nothing trio. - TestApplyBootstrap_KEKUnwrapFailure — synthetic Unwrap error propagated; no partial install. - TestApplyBootstrap_Replay_Idempotent — same payload twice succeeds. - TestApplyRotation_ReturnsKEKNotConfigured (existing, reframed) — no-options default. - TestApplyRotation_FullyWired — happy path with a pre-bootstrap; verifies Active slot for the rotated purpose advances while the other stays put + old DEK retained in keys[]. - TestApplyRotation_UnknownSubTag — halts with ErrEncryptionApply on 0xFE sub-tag. - TestApplyRotation_UnknownPurpose — same halt for unrecognised Purpose enum values. Tests use a fakeKEK implementation that pads/truncates wrapped input to encryption.KeySize so Keystore.Set's size check passes; same wrapped input always produces the same DEK bytes (deterministic), exercising the Keystore.Set idempotency path. ## Stage 5D safety boundary preserved registerEncryptionAdminServer is not touched in this PR. Mutating EncryptionAdmin RPCs continue to refuse with FailedPrecondition exactly as Stage 5D left them. ## Five-lens self-review 1. Data loss — WriteSidecar uses the existing §5.1 crash-durable write protocol. Keystore.Set is in-memory only but the FSM apply path is replayed from the Raft log on restart so the in-memory state is reconstructed. No silent skip: every error propagates through the HaltApply seam. 2. Concurrency / distributed failures — Applier carries no per-call mutable state; the WriterRegistryStore + Keystore + sidecar IO are the single sources of truth. FSM apply already serialises 0x03/0x04/0x05 dispatches under applyMu. Keystore is internally locked. Sidecar IO is single-writer per the §5.1 protocol. 3. Performance — ApplyBootstrap does 2 KEK unwraps + 2 Keystore.Set + 1 WriteSidecar + N writer-registry inserts where N is the cluster size at bootstrap. ApplyRotation does 1 KEK unwrap + 1 Set + 1 WriteSidecar + 1 row insert. Bounded. 4. Data consistency — sidecar mutations are crash-durable via WriteSidecar. Keystore install order (storage before raft) is deterministic on every replay. Registry rows replay through the §4.1 case-2-idempotent path so double-apply produces the same final state. 5. Test coverage — 8 new unit tests covering happy paths and fail-closed boundaries (partial wiring, KEK failure, unknown sub-tag, unknown purpose, replay idempotency). The fakeKEK is deterministic so replay tests are reproducible. ## Verification - go test -race -timeout=60s ./internal/encryption/... — PASS - go build ./... — PASS - golangci-lint run ./internal/encryption/... — 0 issues - Existing kv + store + main encryption tests — PASS ## Plan This is Stage 6B-1. The follow-up Stage 6B-2 PR will: - Add --kekFile / --kekUri flags - Add --encryption-enabled flag (default off) - Load KEK at startup, populate Keystore from sidecar - Re-enable Proposer + LeaderView in registerEncryptionAdminServer gated on (--encryption-enabled AND KEKConfigured())
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThe PR implements Stage 6B encryption bootstrap and Stage 6B-1 rotation workflows. The applier gains KEK unwrapper, keystore, and sidecar path dependencies via functional options. ChangesStage 6B Encryption Bootstrap & Rotation Implementation
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@claude review Stage 6B-1 — applier internals only (real ApplyBootstrap + ApplyRotation). Mutator-RPC wiring stays OFF in this PR per the Stage 6 plan; 6B-2 will thread the flags + RPC gate. Looking for review on:
|
|
Claude finished @bootjp's task in 4m 30s —— View job Stage 6B-1 Applier ReviewTasks
Overall AssessmentThe implementation is solid. The crash-recovery ordering, interface design, and fail-closed behaviors are all sound. I found one documentation bug and one gap in test coverage worth addressing. Addressing Your 5 Questions1. Crash-recovery ordering — Sound. Keystore.Set (in-memory) → WriteSidecar (durable IO) → batch registry insert is correct. Walking each crash point:
One subtle thing worth noting in the code comments: the stated reason "an in-memory DEK only escapes to disk after the keystore install succeeded" is slightly misleading — the sidecar stores the wrapped DEK, not cleartext. The real invariant is that a Keystore.Set failure (e.g., 2. KEKUnwrapper interface placement — Correct. Local declaration at the consumer is idiomatic Go and the right call for exactly the reasons stated. I confirmed 3. Replay idempotency — Sound with one note. All three claims verify:
One edge case that isn't tested: if 4. Fail-closed trio — Sound, but test coverage has two gaps. I cannot identify a meaningful partial-wiring use case — the three are genuinely indivisible for the reasons stated. However,
The implementation correctly handles all 6 via 5. Unknown sub-tag / Purpose halt — Right call. Halting with Issues FoundBug: Stale comment fragment on The doc comment has a leftover line from the Stage 6A stub: // ApplyBootstrap returns ErrKEKNotConfigured as the Stage 6A // ← stale fragment
// ApplyBootstrap implements §5.6 step 1a's initial bootstrap apply:The first line is the beginning of the old Stage 6A comment that was never removed when the real implementation replaced it. This reads as two conflicting doc sentences: one claiming it "returns ErrKEKNotConfigured" and the next claiming it "implements §5.6 step 1a". Fix this → Test coverage gap: 2 missing partial-wiring sub-cases —
{name: "kek_and_sidecar_path", opts: []encryption.ApplierOption{
encryption.WithKEK(&fakeKEK{}),
encryption.WithSidecarPath("/tmp/never-written"),
}},
{name: "keystore_and_sidecar_path", opts: []encryption.ApplierOption{
encryption.WithKeystore(encryption.NewKeystore()),
encryption.WithSidecarPath("/tmp/never-written"),
}},Minor Observations (no action needed)
VerdictTwo concrete items worth landing before merge: the stale comment fragment (trivial fix) and the two missing partial-wiring test cases (completes coverage of the all-or-nothing invariant you argued). All five design questions are answered affirmatively — the implementation is correct. |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/encryption/applier.go`:
- Around line 309-337: In ApplyBootstrap, detect and reject bootstrap payloads
where p.StorageDEKID == p.RaftDEKID by returning a wrapped error (e.g.,
errors.Errorf or errors.Wrap) instead of proceeding; this prevents the second
key write from overwriting the first (check the equality early in
Applier.ApplyBootstrap before Unwrap/Set/writeBootstrapSidecar). Also add the
same guard in the code path that writes sidecar keys (where
writeBootstrapSidecar or the sidecar sc.Keys assignments occur) to ensure the
sidecar writer also refuses payloads with identical DEK IDs.
- Around line 130-132: Guard against nil injections by validating both the
provided clock and option functions: in WithNowFunc(now func() time.Time) add an
early check and panic (or return a clear error) immediately if now == nil (e.g.,
panic("WithNowFunc: nil now func")), and also harden the Applier options
application loop (where ApplierOption funcs are iterated, e.g., in NewApplier or
the method that applies options) to check for nil options and fail fast (panic
or return an error) if any opt == nil so a nil ApplierOption cannot later cause
a runtime panic when a.now() is invoked.
- Around line 403-424: ApplyRotation must verify that the proposer registration
targets the same DEK ID as the rotated DEK: extract the DEK ID from
p.ProposerRegistration and compare it to p.DEKID (after you've unwrapped the dek
and persisted it via a.keystore.Set and/or a.writeRotationSidecar), and if they
differ return a wrapped error (e.g. errors.Wrapf(ErrEncryptionApply, "applier:
proposer-registration DEKID %s does not match rotation DEKID %s", regDEKID,
p.DEKID)). Perform this check before calling
a.ApplyRegistration(p.ProposerRegistration) so mismatched registrations are
rejected and the “first writer row for new DEK” monotonicity is preserved.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b5371d0e-ac95-4130-9d8c-67ed5d34df7b
📒 Files selected for processing (2)
internal/encryption/applier.gointernal/encryption/applier_test.go
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 24a7e7607c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if err := a.ApplyRegistration(p.ProposerRegistration); err != nil { | ||
| return errors.Wrap(err, "applier: rotation proposer-registration insert") | ||
| } |
There was a problem hiding this comment.
Validate rotation registration matches rotated DEK
ApplyRotation installs and activates p.DEKID in keystore/sidecar, but then blindly applies p.ProposerRegistration without checking that its DEKID is the same key. A malformed or buggy rotation payload can therefore mutate writer-registry state for an unrelated DEK while leaving the new DEK unregistered in this atomic step, silently violating the intended rotate-dek invariant and potentially causing later false rollback/collision behavior on the wrong DEK.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request implements the core logic for bootstrapping and rotating encryption keys within the Applier. It introduces a functional options pattern for configuring the Applier with a KEK unwrapper, keystore, and sidecar path, enabling the transition from the previous stub implementation to a fully functional state. The changes include ApplyBootstrap for initial key setup and ApplyRotation for updating Data Encryption Keys (DEKs), both designed with crash-durability and Raft-replay idempotency in mind. Feedback was provided regarding the synchronous nature of batch registry inserts in ApplyBootstrap, suggesting that this blocking behavior be explicitly documented and potentially optimized into an atomic batch operation to minimize Raft loop latency.
| for i, reg := range p.BatchRegistry { | ||
| if err := a.ApplyRegistration(reg); err != nil { | ||
| return errors.Wrapf(err, "applier: bootstrap batch registry insert at index %d (dek_id=%d, full_node_id=%#x)", | ||
| i, reg.DEKID, reg.FullNodeID) | ||
| } | ||
| } |
There was a problem hiding this comment.
The ApplyBootstrap method performs synchronous writes for each entry in p.BatchRegistry. While synchronous I/O within the Raft loop is used in this repository to ensure consistency during critical operations (like configuration changes), this blocking startup behavior should be explicitly documented in the method description. Additionally, consider if these sequential calls can be optimized into a single atomic batch to minimize the duration the Raft loop is blocked for large clusters.
References
- For limited-scope implementations, explicitly document blocking startup behavior to prevent misuse.
- When a method is intended to block until a condition is met or an operation completes, explicitly state this blocking semantic in the documentation.
- Synchronous I/O in the raft loop is acceptable when necessary to ensure consistency during membership or configuration changes.
Four actionable findings on round-1 (commit 24a7e76), all addressed in this commit: ## coderabbit Major (line 132) — Guard nil clock injection WithNowFunc(nil) would silently overwrite the default time.Now and nil-deref at apply time when a.now() is invoked. NewApplier now rejects: - any nil ApplierOption (each opt iterated and checked) - a final a.now == nil (sentinel for WithNowFunc(nil) having overwritten the default) Both checks return a typed error at construction so the misconfiguration surfaces at startup rather than deep inside a Raft apply loop. ## coderabbit Major (line 337) — Reject identical bootstrap DEK IDs If StorageDEKID == RaftDEKID, the second sc.Keys[...] assignment in writeBootstrapSidecar overwrites the first, silently mis-labelling the surviving key's purpose. Halts apply with ErrEncryptionApply at the top of ApplyBootstrap (before the keystore install) so the sidecar is never corrupted. ## codex P2 + coderabbit Major (line 423/424) — Proposer DEK mismatch ApplyRotation installs p.DEKID into the keystore + sidecar but then blindly applied p.ProposerRegistration even when its DEKID targeted a different key. The whole point of bundling the proposer-registration atomically with rotate-dek (§5.2) is to cover the proposer's first encrypted write under the NEW DEK with a §4.1 case-1 first-seen row — a mismatch would mutate writer-registry state for an unrelated DEK and leave the new DEK unregistered. Halts apply with ErrEncryptionApply at the top of ApplyRotation (after the sub_tag check, before the keystore install) so an inconsistent rotation payload cannot atomically commit. ## Tests 4 new tests pin the guards: - TestNewApplier_RejectsNilOption - TestNewApplier_RejectsNilNowFunc - TestApplyBootstrap_RejectsIdenticalDEKIDs — additionally verifies the sidecar is NOT written after the rejection (no side effects). - TestApplyRotation_RejectsProposerDEKMismatch ## Caller audit All three apply-time guards are stricter input validation — strictly fail-CLOSED at the input layer. The single production caller (kv/fsm dispatch) propagates the returned error through the HaltApply seam, so the halt surface is identical to any other ErrEncryptionApply path. The 4 partial-wiring sub-cases in TestApplyBootstrap_PartialWiring still pass because they all use valid (non-nil) options. The NewApplier signature did not change; existing callers that pass valid (non-nil) options remain unaffected. ## Verification - go test -race -timeout=60s -run 'TestApply|TestNewApplier' ./internal/encryption/... — PASS (16 tests) - golangci-lint run ./internal/encryption/... — 0 issues
|
@claude review Round-2 (commit
All three apply-time guards are stricter input validation — fail-CLOSED at input layer. Single production caller ( 4 new tests:
Verification
|
|
Claude finished @bootjp's task in 1m 31s —— View job Stage 6B-1 Round-2 ReviewTasks
Round-2 Fix VerificationAll four round-2 fixes are correctly implemented. Fix 1 — The nil-option check fires per-index ( Fix 2 — Identical DEK IDs guard ( Guard fires before the first Fix 3 — Proposer-registration DEK mismatch guard ( Guard placement is correct: after the sub-tag check but before Remaining Items from Round-1 (Not in the Round-2 Fix Table)Stale comment fragment — // ApplyBootstrap returns ErrKEKNotConfigured as the Stage 6A ← stale fragment
// ApplyBootstrap implements §5.6 step 1a's initial bootstrap apply:The first line is the beginning of the Stage 6A stub comment and was never removed when the real implementation replaced it. The two sentences directly contradict each other. Fix this → Two missing partial-wiring test cases (
Suggested additions (no new test logic, same pattern): {name: "kek_and_sidecar_path", opts: []encryption.ApplierOption{
encryption.WithKEK(&fakeKEK{}),
encryption.WithSidecarPath("/tmp/never-written"),
}},
{name: "keystore_and_sidecar_path", opts: []encryption.ApplierOption{
encryption.WithKeystore(encryption.NewKeystore()),
encryption.WithSidecarPath("/tmp/never-written"),
}},No New Issues in Round-2 CodeThe three new guards are all positioned before any state mutation, the new tests are correct, and the fakeKEK's VerdictTwo carry-forward items remain: the one-line stale comment and the two missing partial-wiring cases. The round-2 fixes themselves are clean and all four were addressed correctly. The implementation is sound for merge modulo those two. |
claude[bot] r1 verdict (post-r2 push, ungated by timing) was 'implementation is solid' but flagged 2 carry-over items, and gemini-code-assist posted a separate medium on the same commit. All addressed here: ## claude[bot] r1 #1 — Stale comment fragment at ApplyBootstrap godoc Line 289 had a leftover line from the Stage 6A stub: // ApplyBootstrap returns ErrKEKNotConfigured as the Stage 6A // ApplyBootstrap implements §5.6 step 1a's initial bootstrap apply: The first line is the head of the old Stage 6A stub doc that was never removed when the real implementation replaced it. Reads as two conflicting sentences. Removed. ## claude[bot] r1 #1b — Crash-recovery wording precision The 'an in-memory DEK only escapes to disk after the keystore install succeeded' phrasing was slightly misleading: the sidecar stores the WRAPPED DEK (KEK-output bytes), not cleartext. The actual invariant the ordering achieves is: ErrKeyConflict from Keystore.Set aborts the apply before any disk mutation. Reworded to state that invariant directly. ## claude[bot] r1 #2 — Missing partial-wiring sub-cases (test gap) TestApplyBootstrap_PartialWiring covered 4 of 6 possible partial combinations. Added the missing two: - kek_and_sidecar_path (no keystore) - keystore_and_sidecar_path (no kek) Both correctly hit bootstrapAndRotationConfigured() == false and return ErrKEKNotConfigured. Test now exhaustively covers the 6-combination partial-wiring matrix. ## gemini-code-assist medium — Blocking behaviour documentation gemini flagged that ApplyBootstrap's synchronous per-row inserts (one pebble.Sync per BatchRegistry entry) block the Raft apply loop. The synchronous IO is part of the §6.3 contract — the engine's applyMu already serialises the dispatch, so blocking IS the expected shape — but the godoc did not call it out explicitly. Added a 'Blocking behaviour' section to the ApplyBootstrap godoc: - KEK.Unwrap may dial a KMS in production providers. - WriteSidecar fsyncs to disk. - Every BatchRegistry row is one pebble.Sync. - BootstrapBatchRowCap = 1<<14 bounds the worst case. - A future optimisation could replace per-row Set with pebble.Batch.Commit(pebble.Sync); current shape matches Stage 6A ApplyRegistration semantics row-for-row. This documents the contract; the optimisation is deferred — not a blocker for 6B-1. ## Caller audit Doc-only edits (1 stale-line removal + 1 wording precision fix + blocking-behaviour godoc addition) + 2 new test sub-cases. No code paths changed; no callers to audit. ## Verification - go test -race -timeout=60s -run 'TestApply|TestNewApplier' ./internal/encryption/... — PASS (18 tests, +2 sub-cases) - golangci-lint run ./internal/encryption/... — 0 issues
|
@claude review Round-3 (commit
Verification
|
|
Claude finished @bootjp's task in 2m 11s —— View job Stage 6B-1 Round-3 ReviewTasks
Round-3 Fix VerificationAll four round-3 items are correctly addressed. Fix 1 — Stale comment removed ( The godoc now starts cleanly with Fix 1b — Crash-recovery wording ( The reworded invariant is accurate: "Keystore.Set fires before WriteSidecar so an ErrKeyConflict from Set aborts the apply before any disk mutation." This correctly describes what the ordering achieves — the abort-before-disk-mutation guarantee — without the previous misleading reference to cleartext DEKs escaping to disk. ✅ Fix 2 — Two missing partial-wiring test cases ( Both Fix 3 — Blocking behaviour documented ( The "Blocking behaviour" section correctly characterises each I/O touch point (KMS dial, fsync, pebble.Sync per registry row), explains why synchronous IO is the right contract for the FSM apply path, names the One New Issue Introduced by the Round-2 Guard
The round-2 proposer-registration DEK mismatch guard ( Fix: set err := app.ApplyRotation(fsmwire.RotationPayload{
SubTag: fsmwire.RotateSubRotateDEK,
DEKID: 9,
Purpose: fsmwire.Purpose(0xFE), // not Storage / Raft
Wrapped: []byte("y"),
ProposerRegistration: fsmwire.RegistrationPayload{DEKID: 9, FullNodeID: 0x01, LocalEpoch: 0}, // must match DEKID
})VerdictThree carry-overs from rounds 1 and 2 are closed. One new issue: |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 99c8b76037
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for i, reg := range p.BatchRegistry { | ||
| if err := a.ApplyRegistration(reg); err != nil { | ||
| return errors.Wrapf(err, "applier: bootstrap batch registry insert at index %d (dek_id=%d, full_node_id=%#x)", |
There was a problem hiding this comment.
Reject bootstrap rows that target non-bootstrap DEKs
ApplyBootstrap writes every BatchRegistry row without checking that each row’s DEKID matches either StorageDEKID or RaftDEKID. A malformed 0x04 payload can therefore persist writer-registry state for unrelated DEKs while the sidecar/keystore only install the two bootstrap keys, violating the bootstrap invariant and potentially triggering false rollback/collision behavior on those stray DEKs later. The applier should fail closed here (similar to the rotation proposer-DEK consistency check) before calling ApplyRegistration.
Useful? React with 👍 / 👎.
…K) + claude[bot] r3 test correctness Two findings on round-3 (commit 99c8b76), both addressed: ## codex P2 — Reject bootstrap rows that target non-bootstrap DEKs ApplyBootstrap was writing every BatchRegistry row without checking that each row's DEKID matches either StorageDEKID or RaftDEKID. A malformed 0x04 payload could persist writer-registry state for unrelated DEKs while the bootstrap installs only the two declared keys — silently breaking the §4.1 first-writer invariant on the next post-bootstrap write under that foreign DEK. This is the same class of input-validation gap as the round-2 proposer-DEK-mismatch guard on ApplyRotation. Fix: validate every BatchRegistry[i].DEKID matches one of the two bootstrap DEKs before any state mutation. Halt with ErrEncryptionApply on mismatch. The check is placed in a new validateBootstrap helper alongside the existing wiring-check + identical-DEK-ID guard so the dispatch hot path stays below the cyclop complexity budget. ## claude[bot] r3 — TestApplyRotation_UnknownPurpose tests wrong path The round-2 proposer-DEK-mismatch guard (added in 15681ca) fires before sidecarPurposeFor in ApplyRotation. The test payload used a zero-value ProposerRegistration (DEKID=0) with p.DEKID=9, so 0 != 9 triggered the mismatch guard and the function returned ErrEncryptionApply from THERE — sidecarPurposeFor was never reached. The test still passed (it checked ErrEncryptionApply) but it was testing the wrong code path. Fix: set ProposerRegistration.DEKID: 9 in the test payload so the mismatch guard passes and sidecarPurposeFor is reached. ## New test TestApplyBootstrap_RejectsForeignBatchRegistryDEK pins the new guard. The payload includes one valid row (DEKID matching StorageDEKID) and one foreign row (DEKID 99). The applier halts with ErrEncryptionApply BEFORE any state mutation, so the test also asserts: - keystore.Has(1)/Has(2) are both false (no keystore mutation) - sidecar file does not exist (no disk write) ## Caller audit Both fixes are stricter input validation — fail-CLOSED at the input layer. Single production caller (kv/fsm dispatch) is unaffected; the halt surface is identical to other ErrEncryptionApply paths. The 4 partial-wiring sub-cases in TestApplyBootstrap_PartialWiring still pass — they all use empty BatchRegistry slices so the new loop is trivially satisfied. ## Verification - go test -race -timeout=60s -run 'TestApply|TestNewApplier' ./internal/encryption/... — PASS (19 tests) - golangci-lint run ./internal/encryption/... — 0 issues (extracted validateBootstrap helper keeps ApplyBootstrap under the cyclop budget)
|
@claude review Round-4 (commit
New test Caller auditBoth fixes are stricter input validation, fail-CLOSED at input layer, single production caller (kv/fsm dispatch) unaffected. 4 partial-wiring sub-cases still pass (empty BatchRegistry trivially satisfies the new loop). Verification
|
|
Claude finished @bootjp's task in 2m 15s —— View job Stage 6B-1 Round-4 ReviewTasks
Round-4 Fix VerificationBoth round-4 fixes are correctly implemented. Fix 1 — Foreign BatchRegistry DEKID guard (
Critically, The Fix 2 — Setting One Design Observation (no action needed)
With the mismatch guard passing, the execution path is now: This is behaviorally correct: the FSM halts, the operator must fix the malformed entry before the cluster can resume, and on restart the keystore is rebuilt from the sidecar (6B-2 startup sequence), so the orphaned DEK 9 entry vanishes. The test doesn't verify this side effect exists but that's fine — the test is pinning the fail-closed behavior of Compare with No New Issues FoundThe |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 88da2b21f0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| sc.Active.Storage = p.StorageDEKID | ||
| sc.Active.Raft = p.RaftDEKID |
There was a problem hiding this comment.
Reject repeated bootstrap once active DEKs exist
ApplyBootstrap currently reads any existing sidecar and then unconditionally overwrites Active.Storage/Active.Raft, so a second committed bootstrap entry with different DEK IDs will succeed and re-point the cluster’s active keys instead of failing as an already-bootstrapped state. This bypasses the intended one-time bootstrap invariant and allows key-state transitions outside the rotation path, which can leave writer-registry and key lifecycle semantics inconsistent with the bootstrap/rotation contract.
Useful? React with 👍 / 👎.
…h different DEKs codex P1 on round-4 (commit 88da2b2): ApplyBootstrap was reading the existing sidecar and unconditionally overwriting Active.{Storage,Raft}, so a second committed bootstrap entry with different DEK IDs would succeed and re-point the cluster's active keys instead of failing as already-bootstrapped. This bypasses the §5.6 one-time-bootstrap invariant and lets key-state transitions happen outside the rotation path, leaving writer-registry and key-lifecycle semantics inconsistent. ## Fix validateBootstrap now calls checkBootstrapIdempotency after the per-row BatchRegistry validation. The idempotency check: - sidecar absent → first bootstrap, fall through - sidecar present but Active=(0,0) → never bootstrapped (or a malformed sidecar), fall through. The §9.1 startup guard in Stage 6C will reject the malformed-sidecar shape; this apply-time check focuses on the bootstrap invariant. - Active matches the payload's (StorageDEKID, RaftDEKID) → idempotent replay of the SAME bootstrap (e.g., crash after apply but before snapshot). Allow. - Active differs from the payload → second bootstrap with different DEKs. Halt with ErrEncryptionApply. checkBootstrapIdempotency is split out of validateBootstrap to keep the dispatch path under the cyclop complexity budget (validateBootstrap was 13, max is 10). ## Test TestApplyBootstrap_RejectsSecondBootstrapWithDifferentDEKs: - First bootstrap with (Storage=1, Raft=2) succeeds. - Second bootstrap with (Storage=7, Raft=8) is rejected with ErrEncryptionApply. - Post-rejection sidecar Active is verified to STILL be (1, 2) — the rejected bootstrap must not mutate Active. The existing TestApplyBootstrap_Replay_Idempotent continues to pass: same payload twice → second apply hits the matching-DEK branch of checkBootstrapIdempotency → falls through → succeeds. ## Caller audit Stricter input validation, fail-CLOSED. Single production caller (kv/fsm dispatch) propagates via HaltApply. Idempotent replay (matching DEK IDs) still works, so no change to crash- recovery semantics. The new rejected case is only a second bootstrap with DIFFERENT DEK IDs — operationally only reachable from a malformed or malicious BootstrapEncryption RPC. ## Verification - go test -race -timeout=60s -run 'TestApply|TestNewApplier' ./internal/encryption/... — PASS (20 tests) - golangci-lint run ./internal/encryption/... — 0 issues
|
@claude review Round-5 (commit codex P1
FixNew
Test
Caller auditStricter fail-CLOSED at input layer. Single production caller ( Verification
|
|
Claude finished @bootjp's task in 1m 43s —— View job Stage 6B-1 Round-5 ReviewTasks
Round-5 Fix Verification
All four state transitions are correctly handled (
The function is correctly positioned at the tail of Guard fires before any state mutation — correct ordering confirmed.
On the second apply of the same payload: Note:
The test correctly:
One no-side-effects assertion is missing — if ks.Has(7) || ks.Has(8) {
t.Error("keystore mutated despite rejection of second bootstrap")
}Minor Observations (no action needed)Double sidecar read on idempotent replay ( The VerdictRound-5 is clean. The |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f27a8a0fc7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if sc.Active.Storage == p.StorageDEKID && sc.Active.Raft == p.RaftDEKID { | ||
| return nil |
There was a problem hiding this comment.
Reject bootstrap replays unless full payload matches
The idempotency check currently treats any bootstrap as valid replay when Active.Storage and Active.Raft match, then ApplyBootstrap still executes sidecar writes and the full BatchRegistry apply loop. That means a second committed bootstrap with the same DEK IDs but different BatchRegistry (or re-wrapped bytes that unwrap to the same DEK) can mutate writer-registry/sidecar state after the cluster is already bootstrapped, which violates the one-time bootstrap invariant. Fresh evidence in this revision is that checkBootstrapIdempotency returns early on ID match while ApplyBootstrap continues with for i, reg := range p.BatchRegistry.
Useful? React with 👍 / 👎.
…ay check) + claude r5 test gap Two findings on round-5 (commit f27a8a0), both addressed: ## codex P1 — Reject bootstrap replays unless full payload matches Round-5's checkBootstrapIdempotency matched on DEK IDs alone. That allowed a second bootstrap entry committed with the SAME DEK IDs but DIFFERENT wrapped DEK bytes (or BatchRegistry rows) to flow past the idempotency check, then re-run writeBootstrapSidecar (overwriting keys[].Wrapped with new bytes) and iterate the new BatchRegistry — potentially inserting writer-registry rows for previously-unregistered nodes through the §4.1 case-1 first-seen path. ## Fix checkBootstrapIdempotency now compares wrapped DEK bytes against the sidecar's existing keys[id].Wrapped entries after the DEK-ID match passes. Raft determinism couples wrapped bytes to the entry hash — so: - Legitimate replay: same Raft entry, same payload bytes, wrapped bytes match → fall through to idempotent apply. - Malformed/malicious second bootstrap with different wrapped material at the same DEK IDs: wrapped bytes differ → halt with ErrEncryptionApply. If the wrapped bytes match, BatchRegistry must also match (Raft entry hash couples ALL fields), so the registry batch insert is provably idempotent. The new logic is split into checkBootstrapWrappedMatches to keep the dispatch path under the cyclop complexity budget. ## Defense-in-depth for sidecar inconsistency If sc.Active matches the payload but sc.Keys[id] is missing for either DEK ID, the sidecar is internally inconsistent — halt with a descriptive error rather than nil-deref on the missing map entry. ## claude r5 — Missing no-side-effects assertion TestApplyBootstrap_RejectsSecondBootstrapWithDifferentDEKs verified the sidecar Active was not re-pointed but did not check that the rejected DEKs (7, 8) were absent from the keystore. Pattern set by TestApplyBootstrap_RejectsForeignBatchRegistryDEK requires both. Added the ks.Has(7) || ks.Has(8) assertion. ## New test TestApplyBootstrap_RejectsSecondBootstrapWithSameIDsDifferentWrapped: - First bootstrap with wrapped 's1-orig' / 'r2-orig' succeeds. - Second bootstrap with same DEK IDs but wrapped 's1-tampered' / 'r2-tampered' AND a previously- unregistered node 0x99 in BatchRegistry is rejected. - Post-rejection: sidecar keys[].Wrapped still has ORIGINAL bytes (not tampered); writer-registry row for node 0x99 was NOT inserted (the rejection fired before ApplyRegistration could case-1-insert it). ## Caller audit Stricter input validation, fail-CLOSED at the input layer. Same as previous rounds: single production caller (kv/fsm dispatch) propagates via HaltApply. Legitimate Raft replay (matching wrapped bytes) still succeeds, so no change to crash-recovery semantics. ## Verification - go test -race -timeout=60s -run 'TestApply|TestNewApplier' ./internal/encryption/... — PASS (21 tests) - golangci-lint run ./internal/encryption/... — 0 issues
|
@claude review Round-6 (commit codex P1Round-5's idempotency check matched on DEK IDs alone. A second bootstrap with SAME DEK IDs but DIFFERENT wrapped bytes (or BatchRegistry rows) would have flowed past — re-running writeBootstrapSidecar with new wrapped bytes + iterating new BatchRegistry rows (potentially case-1-inserting unregistered nodes). Fix
Added defense-in-depth: if Active matches the payload but keys[id] is missing for either DEK ID, the sidecar is internally inconsistent → halt with a descriptive error. claude r5 gapAdded New test
Caller auditStricter fail-CLOSED. Legitimate Raft replay (matching wrapped bytes) still succeeds. Verification
|
|
Claude finished @bootjp's task in 2m 29s —— View job Stage 6B-1 Round-6 ReviewTasks
Round-6 Fix VerificationFix 1 — Both the wrapped-bytes check and the defense-in-depth guard are correctly implemented. Call chain: Walking
The missing-keys guard is correctly positioned and catches the pathological case where Fix 2 —
Fix 3 — The test exercises the new code path correctly:
The Raft-determinism argument in the PR description is sound: if wrapped bytes match, the entire log entry is the same, so BatchRegistry must match too. The wrapped-bytes check is the correct proxy for "legitimate replay."
On the second apply: No New Issues FoundThe implementation is clean. All five state transitions in One observation for completeness (no action needed): VerdictAll six rounds of findings are closed. No new issues in round-6. The implementation is ready for merge. |
…ag + mutator-RPC gate (#776) ## Summary Stage 6B-2 per the [PR #762 plan](#762). Completes Stage 6B with the operator-facing surface: `--kekFile` / `--encryption-enabled` flags, KEK loader at startup, applier threading, and the re-enabled mutator-RPC wiring in `registerEncryptionAdminServer` gated on (`--encryption-enabled` AND `--kekFile` non-empty). Builds on Stage 6B-1 (PR #768) which landed the real `ApplyBootstrap` / `ApplyRotation` paths. This PR wires those paths into production by threading KEK + Keystore + sidecar path through the FSM construction and re-enabling the EncryptionAdmin mutator RPCs under a double gate. ## New flags | Flag | Default | Purpose | |---|---|---| | `--kekFile` | `""` | §5.1 KEK file path (32 raw bytes, owner-only mode). Constructs `kek.FileWrapper` at startup; threaded into the §6.3 EncryptionApplier. `--kekUri` deferred to Stage 9 (KMS providers). | | `--encryption-enabled` | `false` | §6.5 opt-in to the mutating EncryptionAdmin RPCs. Required (alongside `--kekFile`) for `BootstrapEncryption`/`RotateDEK`/`RegisterEncryptionWriter` to wire. | ## Double-gate Both flags are independently necessary for the mutator wiring: ```go func encryptionMutatorsEnabled() bool { return *encryptionEnabled && *kekFile != "" } ``` - `--encryption-enabled` alone → no KEK loaded → mutators wired but ApplyBootstrap would HaltApply with `ErrKEKNotConfigured` on every replica. The RPC-layer gate refuses BEFORE the proposal commits. - `--kekFile` alone → no operator opt-in → mutators stay refused at the RPC boundary. - BOTH set → `WithEncryptionAdminProposer` + `WithEncryptionAdminLeaderView` installed and mutators reach the §6.3 applier through the shard's engine. ## Caller audit (semantic changes — signatures) - **`buildShardGroups`** gains 3 new tail params `(kekWrapper *kek.FileWrapper, keystore *encryption.Keystore, sidecarPath string)`. 1 production caller (`main.go run()`), 3 test callers (`multiraft_runtime_test.go` ×2, `main_bootstrap_e2e_test.go` ×1) — all updated to pass `nil, nil, ""` for the no-encryption posture. - **`registerEncryptionAdminServer`** gains 2 new tail params `(enableMutators bool, engine encryptionAdminEngine)`. 1 production caller (`main.go startRaftServers`), 2 test callers (`main_encryption_admin_test.go`) — all updated. - The mutator gate is fail-CLOSED by default at every site that does not explicitly opt in. ## Test surface | Test | Pins | |---|---| | `TestEncryptionAdmin_MutatingRPCRefusedWhenGateOff` (3 sub-cases) | All 3 corners of (enableMutators, engine) with at least one false → `FailedPrecondition` | | `TestEncryptionAdmin_MutatingRPCEnabledWhenGateOn` | The (true, non-nil) corner — asserts the `FailedPrecondition` gate is NO LONGER firing | A `stubEncryptionAdminEngine` satisfies the `encryptionAdminEngine` interface for the gate-on test (no actual Raft proposal — the stub `Propose()` returns success). ## Stage 5D safety boundary preserved The Stage 5D regression — operators who set `--encryptionSidecarPath` alone must not see mutators wired — is preserved by the double gate. `sidecarPath` alone is strictly the capability surface; the mutator wiring is gated on the AND of two SEPARATE flags. ## Five-lens self-review 1. **Data loss** — no data path touched. The applier remains fail-closed under HaltApply for any `ErrEncryptionApply` return. Pebble Sync semantics unchanged. 2. **Concurrency / distributed failures** — shared Keystore internally locked; KEK wrapper documented safe for concurrent use; per-shard applier reads from shared instances. 3. **Performance** — no hot-path change. KEK loading is a single file read at startup. Per-shard applier construction adds one slice allocation per group (0–3 element slice). 4. **Data consistency** — mutator gate is the load-bearing safety boundary. With both gate inputs true, the applier still has Stage 6B-1's input-validation guards (bootstrap idempotency, foreign-DEK batch rejection, proposer-DEK mismatch, etc.). 5. **Test coverage** — 4 gate corners + the existing applier tests from 6B-1 + the `buildShardGroups` callers. ## Test plan - [x] `go test -race -timeout=120s -run 'TestEncryptionAdmin|TestRegisterEncryptionAdminServer|TestApply|TestNewApplier|TestWriterRegistry' ./internal/encryption/... ./store/... ./kv/... .` — PASS - [x] `go build ./...` — PASS - [x] `golangci-lint run ./...` — 0 issues on touched files (cyclop / mnd under budget via factored `loadKEKWrapperFromFlag` + `applierOptionsFor` + `encryptionMutatorsEnabled` helpers) - [ ] Full Jepsen suite — not run in this PR (no envelope cutover happens until Stage 6D; the mutator path is reachable but the operator must hand-craft a `BootstrapEncryption` RPC, which is not part of the Jepsen workload set) ## Plan Stage 6B is now complete. The next stages per [PR #762](#762): - **6C** — §9.1 startup refusal guards - **6D** — `enable-storage-envelope` admin RPC + §7.1 Phase-1 cutover - **6E** — `enable-raft-envelope` admin RPC + §7.1 Phase-2 cutover - **6F** — `--encryption-rotate-on-startup` ergonomics <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes * **New Features** * Added encryption opt-in configuration with new command-line flags for enabling encryption support. * Encryption admin mutations are now conditionally gated and fail safely when encryption is not properly configured. * **Tests** * Updated test coverage for encryption admin server gating behavior. <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/bootjp/elastickv/pull/776?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
Stage 6B part 1 per the PR #762 plan. Lands the real
ApplyBootstrapandApplyRotationpaths ininternal/encryption.Applier— KEK-unwrap + Keystore install + crash-durable sidecar mutate + writer-registry batch insert. Replaces the Stage 6AErrKEKNotConfiguredstubs.Mutator-RPC wiring in
registerEncryptionAdminServerstays OFF (Stage 5D posture) so the operator surface is unchanged. The follow-up 6B-2 PR threads--kekFile/--encryption-enabledflags + startup KEK loader + keystore re-populate + re-enables mutator wiring gated on(--encryption-enabled AND KEKConfigured()).API additions
KEKUnwrapperinterface — local declaration so the encryption package does not pick up a transitive dependency on the kek package.kek.Wrappersatisfies it structurally.ApplierOptionfunctional-options:WithKEK,WithKeystore,WithSidecarPath,WithNowFunc. The Stage 6A callerNewApplier(reg)continues to work byte-for-byte.Real ApplyBootstrap (§5.6 step 1a)
Keystore.Setboth (idempotent on matching bytes;ErrKeyConflicton mismatch → halts).WriteSidecarwithActive.{Storage,Raft}set +keys[]map updated; crash-durable via §5.1 atomic write.BatchRegistryrow viaApplyRegistration(§4.1 case-2-idempotent makes replays no-op).Returns
ErrKEKNotConfiguredif any ofWithKEK/WithKeystore/WithSidecarPathis missing — the three are an indivisible trio.Crash-recovery ordering:
Keystore.Set(cheap, in-memory) →WriteSidecar(durable IO) → batch insert. Recovery is Raft replaying the same entry; every step is idempotent for matching inputs.Real ApplyRotation (§5.2 RotateDEK)
Handles
RotateSubRotateDEKonly; other sub-tags (rewrap-deks, retire-dek, enable-storage-envelope, enable-raft-envelope) halt withErrEncryptionApplyso 6B-1 cannot silently mis-apply later sub-tags.For
RotateSubRotateDEK:Keystore.Setat the new DEK ID.WriteSidecar: Active slot updated forPurpose(Storage or Raft only — unknown Purpose halts); keys[] map updated; old DEKs preserved (retire is a separate Stage 6E sub-tag).ApplyRegistrationon the proposer's row.Tests
8 new applier tests covering happy paths + fail-closed boundaries:
TestApplyBootstrap_ReturnsKEKNotConfiguredErrKEKNotConfiguredTestApplyBootstrap_FullyWiredTestApplyBootstrap_PartialWiring(4 sub-cases)TestApplyBootstrap_KEKUnwrapFailureTestApplyBootstrap_Replay_IdempotentTestApplyRotation_ReturnsKEKNotConfiguredTestApplyRotation_FullyWiredTestApplyRotation_UnknownSubTag0xFEsub-tagTestApplyRotation_UnknownPurposeTests use a
fakeKEKthat pads/truncates wrapped input toencryption.KeySizesoKeystore.Set's size check passes; same wrapped input always produces the same DEK bytes (deterministic), exercising idempotency.Caller audit (semantic change —
NewAppliersignature)NewApplier(registry)→NewApplier(registry, opts...). Variadic; the no-options call (Stage 6A test caller) still compiles and behaves identically. Audited:main.go:711— production caller; unchanged in this PR (still passes no opts). 6B-2 will addWithKEK + WithKeystore + WithSidecarPath.internal/encryption/applier_test.go— test callers; existingNewApplier(newMapRegistryStore())calls still work; new tests pass opts.No other callers.
Five-lens self-review
WriteSidecaruses the existing §5.1 crash-durable write protocol.Keystore.Setis in-memory only but the FSM apply path is replayed from Raft on restart. No silent skip: every error propagates through the HaltApply seam.0x03/0x04/0x05dispatches underapplyMu. Keystore is internally locked. Sidecar IO is single-writer per §5.1.ApplyBootstrap: 2 KEK unwraps + 2Keystore.Set+ 1WriteSidecar+ N writer-registry inserts (N = bootstrap-time cluster size).ApplyRotation: 1+1+1+1. Bounded.Test plan
go test -race -timeout=60s ./internal/encryption/...— PASSgo build ./...— PASSgolangci-lint run ./internal/encryption/...— 0 issuesPlan
This is Stage 6B part 1. Stage 6B-2 will:
--kekFile/--kekUriflags--encryption-enabledflag (default off)registerEncryptionAdminServergated on (--encryption-enabledANDKEKConfigured())Summary by CodeRabbit
New Features
Tests