Skip to content

feat(encryption): Stage 6A — §6.3 Applier scaffolding + FSM WithEncryption wiring#765

Merged
bootjp merged 5 commits into
mainfrom
feat/encryption-stage6a
May 16, 2026
Merged

feat(encryption): Stage 6A — §6.3 Applier scaffolding + FSM WithEncryption wiring#765
bootjp merged 5 commits into
mainfrom
feat/encryption-stage6a

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented May 16, 2026

Summary

First slice of Stage 6A per the PR #762 plan: the load-bearing safety boundary that lands the §6.3 EncryptionApplier so future PRs can re-enable mutating-RPC wiring without bricking the cluster.

This PR is doc-driven by the merged Stage 6 plan; specifically the 6A row:

§6.3 EncryptionApplier concrete implementation (writer-registry insert + sidecar mutate + keystore update; nil-KEK code paths in ApplyBootstrap / ApplyRotation return a typed ErrKEKNotConfigured defense-in-depth marker until 6B fills it) + kv.NewKvFSMWithHLC(... WithEncryption(applier)) wiring. Mutator wiring in registerEncryptionAdminServer stays OFF, identical to Stage 5D.

What this PR ships

Part 1 (this commit, e2475b84) — applier package

  • internal/encryption/applier.go (new) — concrete Applier satisfying kv.EncryptionApplier:

    • WriterRegistryStore interface (abstract Get/Set over the !encryption|writers|<dek_id>|<uint16(node_id)> Pebble namespace; kept separate from store.MVCCStore because writer-registry rows live outside MVCC).
    • ApplyRegistration implements the §4.1 four-case dispatch (insert / monotonic advance / rollback halt / uint16-collision halt).
    • ApplyBootstrap and ApplyRotation return the typed ErrKEKNotConfigured defense-in-depth marker — Stage 6B will replace them with the real KEK-unwrap + sidecar + keystore path.
  • internal/encryption/errors.go — add ErrKEKNotConfigured sentinel with godoc referencing the PR docs(encryption): Stage 6 plan — decompose into 6A-6F + record 5D shipped #762 safety-boundary decomposition.

  • internal/encryption/applier_test.go (new) — 7 unit tests covering all four §4.1 cases, construction-time nil-registry refusal, store-error propagation, and the 6B-deferred stub contracts. Uses a map-backed WriterRegistryStore fake.

Part 2 (follow-up commit in this PR) — FSM wiring

  • Concrete WriterRegistryStore implementation backed by the FSM's Pebble store (store/encryption_glue.go or new file).
  • kv.NewKvFSMWithHLC(... kv.WithEncryption(applier)) wiring at main.go:694.

The Part 1 commit compiles and tests pass on its own; Part 2 plumbs it into production.

What this PR does NOT do

registerEncryptionAdminServer is untouched. Mutating EncryptionAdmin RPCs continue to refuse with FailedPrecondition exactly as Stage 5D left them. The Stage 6 plan's safety invariant — do not re-enable mutator wiring until both the applier AND a KEK unwrapper are present — holds: this PR lands the applier (and the FSM dispatch is no longer a halt path for ApplyRegistration), but the KEK plumbing (6B) is still missing. 6B will re-enable the mutator wiring gated on (--encryption-enabled AND KEKConfigured()), both conditions required.

Caller audit

  • kv.EncryptionApplier interface — implemented by *encryption.Applier. No other callers; the FSM dispatch is the sole consumer.
  • internal/encryption/errors.go exported sentinels — ErrKEKNotConfigured is new; no callers yet outside applier.go (the kv/fsm dispatch will wrap any non-nil return with ErrEncryptionApply regardless of the inner marker).

Five-lens self-review

  1. Data loss — applier's writes go through WriterRegistryStore.SetRegistryRow, which the production wiring (Part 2) routes through pebble.Sync. No silent skip path: every error return is propagated; the kv/fsm dispatch marks them with ErrEncryptionApply for HaltApply.
  2. Concurrency / distributed failures — Applier carries no in-memory state between calls; the WriterRegistryStore is the single source of truth. FSM apply already serialises 0x03 entries via applyMu, so the §4.1 case-dispatch read-modify-write is safe under that lock.
  3. PerformanceApplyRegistration does one Get + at most one Set per call. Production Pebble will add the sync cost; bounded.
  4. Data consistency — §4.1 case 1/2/3/4 dispatch matches the design doc. Case 3 (rollback) halts apply; the 6C startup guard will prevent it from being reachable in production. Case 4 (uint16 collision) similarly halts and is the per-node backstop for §6.1's cluster-wide uniqueness check.
  5. Test coverage — 7 unit tests covering all four §4.1 cases, construction refusal, store-error propagation, and the 6B-deferred stub contracts. Map-backed fake mirrors the production semantics.

Test plan

  • go test -race -timeout=60s ./internal/encryption/... — PASS
  • go build ./... — PASS
  • golangci-lint run ./internal/encryption/... — 0 issues
  • After Part 2 commit: go test -race ./... (full suite) — Will run before requesting review
  • After Part 2 commit: Re-run jepsen-relevant suite (none of 6A's behaviour is operator-visible, so no envelope-cutover Jepsen workload applies; documented in PR description)

Plan: 6A → 6B → 6C → 6D → 6E → 6F

This is the first of six PRs per PR #762. 6B will follow with the KEK plumbing + mutator gate.

Summary by CodeRabbit

  • New Features

    • Integrated encryption into core system operations with enhanced writer registration and persistence.
    • Added encryption configuration validation and detailed error reporting for setup issues.
  • Tests

    • Comprehensive test coverage for encryption functionality, data storage, and integration scenarios.
    • Verification tests for encryption configuration and startup validation.
  • Chores

    • Enhanced application startup sequence with encryption support integration.

Review Change Stack

…ter-registry insert

First slice of Stage 6A per the PR #762 plan. Lands the
concrete kv.EncryptionApplier with the §4.1 four-case
dispatch for ApplyRegistration; ApplyBootstrap and
ApplyRotation return a typed ErrKEKNotConfigured marker
deferred to Stage 6B.

## What this commit ships

- internal/encryption/applier.go (new)
  - WriterRegistryStore interface — abstract Pebble Get/Set
    over the !encryption|writers|<dek_id>|<uint16(node_id)>
    namespace. Kept separate from store.MVCCStore because
    writer-registry rows live outside the MVCC namespace
    (no commit timestamps, no visibility, no retention).
  - Applier struct + NewApplier constructor; nil-registry
    refusal at construction time so wiring mistakes surface
    at startup rather than deep in Raft apply.
  - ApplyRegistration: §4.1 case 1 (insert) / case 2
    (monotonic advance) / case 3 (rollback halt) / case 4
    (uint16 collision halt) dispatch. The halt paths are
    wrapped with ErrEncryptionApply so the kv/fsm seam
    routes them through the HaltApply pipeline.
  - ApplyBootstrap and ApplyRotation: return
    ErrKEKNotConfigured. Stage 6B will swap these for the
    real KEK-unwrap + sidecar + keystore path.

- internal/encryption/errors.go
  - Add ErrKEKNotConfigured sentinel with godoc that
    references the PR #762 Stage 6 plan's safety boundary
    decomposition (RPC gate primary, apply-time halt
    secondary defense-in-depth).

- internal/encryption/applier_test.go (new)
  - Map-backed WriterRegistryStore test fake.
  - TestNewApplier_RejectsNilRegistry — construction-time
    guard.
  - TestApplyRegistration_Case1_Insert — fresh row.
  - TestApplyRegistration_Case2_MonotonicAdvance — same
    full_node_id, strictly greater local_epoch, FirstSeen
    preserved.
  - TestApplyRegistration_Case3_Rollback — equal AND
    lower local_epoch under same full_node_id both halt
    with ErrEncryptionApply.
  - TestApplyRegistration_Case4_Uint16Collision — two
    distinct full_node_ids whose lower 16 bits collide
    halt with ErrEncryptionApply.
  - TestApplyRegistration_StoreErrorPropagates — synthetic
    Get/Set errors propagate (not swallowed); the kv/fsm
    dispatch will mark them with ErrEncryptionApply
    regardless of upstream marking.
  - TestApplyBootstrap_ReturnsKEKNotConfigured /
    TestApplyRotation_ReturnsKEKNotConfigured — stub
    contract for the 6B-deferred paths.

## What this commit does NOT ship (deferred to a follow-up commit in this PR)

- kv.NewKvFSMWithHLC(..., kv.WithEncryption(applier))
  wiring in main.go's FSM construction site (main.go:694).
- Concrete WriterRegistryStore implementation backed by
  the FSM's Pebble store. This needs either a new method
  on store.MVCCStore or a separate handle from PebbleStore;
  the design choice is the bulk of the second commit.

The Applier package compiles and tests pass on its own;
the wiring commit adds it to the FSM construction path.

## Stage 5D safety boundary preserved

registerEncryptionAdminServer is NOT touched in this PR.
Mutating EncryptionAdmin RPCs continue to refuse with
FailedPrecondition exactly as Stage 5D left them. The PR
plan's safety invariant — do not re-enable mutator wiring
until both the applier AND a KEK unwrapper are present —
holds: this PR lands the applier but not the KEK plumbing
(6B); 6B re-enables the mutator wiring gated on
(--encryption-enabled AND KEKConfigured()).

## Five-lens self-review

1. Data loss — applier's writes go through WriterRegistryStore.SetRegistryRow,
   which the production wiring (next commit) will route through pebble.Sync.
   No silent skip path: every error return is propagated; the kv/fsm dispatch
   marks them with ErrEncryptionApply for HaltApply. Map-backed test fake
   makes Set durable trivially.
2. Concurrency / distributed failures — Applier carries no in-memory state
   between calls; the WriterRegistryStore is the single source of truth.
   FSM apply already serialises 0x03 entries via applyMu in kv/fsm.go, so
   the §4.1 case-dispatch read-modify-write is safe under that lock.
3. Performance — ApplyRegistration does one Get + at most one Set per call.
   The map-backed test fake exercises the hot path; production Pebble will
   add the sync cost but that is bounded.
4. Data consistency — §4.1 case 1/2/3/4 dispatch matches the design doc.
   Case 3 (rollback) halts apply — a 6C startup guard will prevent the
   condition from being reachable in production; until then the apply-time
   halt is the load-bearing backstop.
5. Test coverage — 7 unit tests covering all four §4.1 cases, construction
   refusal, store-error propagation, and the 6B-deferred stub contracts.

## Verification

- go test -race -timeout=60s ./internal/encryption/... — PASS
- golangci-lint run ./internal/encryption/... — 0 issues
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 16, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 22 minutes and 52 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9b1a258e-32e6-4a15-9abd-e09b5787c4d5

📥 Commits

Reviewing files that changed from the base of the PR and between 8bbcd03 and fec2322.

📒 Files selected for processing (4)
  • internal/encryption/applier.go
  • internal/encryption/applier_test.go
  • main.go
  • store/encryption_glue.go
📝 Walkthrough

Walkthrough

This PR adds the encryption applier runtime, which validates and persists FSM encryption events (DEK registrations) to durable storage. It introduces the WriterRegistryStore contract, four-case registration upsert logic with collision and rollback detection, Pebble-backed registry storage, and integrates into main FSM construction, while leaving bootstrap/rotation handlers stubbed pending KEK unwrapper wiring.

Changes

Encryption Applier and Writer Registry

Layer / File(s) Summary
WriterRegistryStore contract and Applier core
internal/encryption/applier.go
Defines WriterRegistryStore abstraction with GetRegistryRow and SetRegistryRow for durable encryption writer-registry operations. Applier type holds the registry dependency and NewApplier constructor validates non-nil store at construction time.
ApplyRegistration with four-case upsert and validation tests
internal/encryption/applier.go, internal/encryption/applier_test.go
ApplyRegistration executes four-case upsert keyed by (DEKID, uint16(FullNodeID)): inserts new rows, advances LastSeenLocalEpoch for monotonic epochs while preserving FirstSeenLocalEpoch, halts with wrapped ErrEncryptionApply on epoch rollback, and halts on uint16 FullNodeID collision. Test suite validates insertion, monotonic advance, rollback detection, collision detection, nil-check, and store error propagation via in-memory registry helper.
ApplyBootstrap/ApplyRotation stubs and KEK configuration error
internal/encryption/applier.go, internal/encryption/errors.go, internal/encryption/applier_test.go
ApplyBootstrap and ApplyRotation return ErrKEKNotConfigured as defense-in-depth stubs indicating Stage 6B is required. ErrKEKNotConfigured sentinel error is added to signal missing KEK unwrapper dependency. Tests verify stub behavior for both bootstrap and rotation paths.
Pebble-backed WriterRegistry implementation
store/encryption_glue.go, store/writer_registry_test.go
pebbleWriterRegistry implements WriterRegistryStore by reading/writing Pebble keys with dbMu.RLock and pebble.Sync durability. WriterRegistryFor factory type-asserts supplied MVCCStore to Pebble or returns startup-fatal ErrUnsupportedStoreForWriterRegistry. Tests verify round-trip byte identity, overwrite idempotence, and non-Pebble rejection.
FSM integration in main
main.go
Shard-group FSM construction now creates writer registry via WriterRegistryFor, builds applier via encryption.NewApplier, and passes applier to kv.NewKvFSMWithHLC via kv.WithEncryption. Closes store and returns wrapped error on applier creation failure.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • bootjp/elastickv#748: This PR's Applier and Apply* halt/upsert logic builds directly on Stage-4 encryption FSM plumbing, including fsmwire payload types and ErrEncryptionApply halt sentinel from #748.
  • bootjp/elastickv#719: Both PRs add to internal/encryption/errors.go—this PR adds ErrKEKNotConfigured alongside foundation errors introduced in #719.
  • bootjp/elastickv#759: That PR's BootstrapEncryption produces bootstrap/writer-registry entries validated against the same (DEKID, uint16 FullNodeID) collision and epoch semantics that this PR's ApplyRegistration enforces.

Poem

🐰 A registry hops through Pebble's deep store,
Recording which nodes have seen DEK before.
Four cases caught, collisions deflect,
Bootstrap and rotation—delayed with respect.
Durable epochs advance, never rewind!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: implementing Stage 6A with Applier scaffolding and FSM encryption wiring, which aligns with all modified files and the PR objectives.
Docstring Coverage ✅ Passed Docstring coverage is 87.50% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/encryption-stage6a

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the Applier component and WriterRegistryStore interface to manage encryption state, specifically implementing writer registration logic with checks for monotonic epochs and NodeID collisions. It includes comprehensive unit tests and stub implementations for bootstrap and rotation. Feedback suggests renaming the localEpochMaskU64 constant to nodeIDMask to more accurately describe its role in masking the FullNodeID.

Comment thread internal/encryption/applier.go Outdated
// the gate, or in a forensic / corruption scenario) still halts
// rather than silently advancing setApplied.
func (a *Applier) ApplyRegistration(p fsmwire.RegistrationPayload) error {
key := RegistryKey(p.DEKID, uint16(p.FullNodeID&localEpochMaskU64)) //nolint:gosec // masked to 16 bits; G115 cannot trace the bitwise narrowing
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The constant name localEpochMaskU64 is misleading here. This mask is being applied to p.FullNodeID to extract its lower 16 bits for the registry key, as specified in §4.1 of the design. The LocalEpoch field is already a uint16 and does not require masking. Please rename this constant to something more descriptive, such as nodeIDMask or nodeIDMaskU16. Additionally, verify that the NodeID from the decoded context matches the NodeID of the current change being processed to prevent incorrect metadata reuse.

Suggested change
key := RegistryKey(p.DEKID, uint16(p.FullNodeID&localEpochMaskU64)) //nolint:gosec // masked to 16 bits; G115 cannot trace the bitwise narrowing
key := RegistryKey(p.DEKID, uint16(p.FullNodeID&nodeIDMask)) //nolint:gosec // masked to 16 bits; G115 cannot trace the bitwise narrowing
References
  1. When processing a configuration change, verify that the NodeID from the decoded context matches the NodeID of the current change being processed to prevent incorrect metadata reuse.

Comment thread internal/encryption/applier.go Outdated
return errors.Wrap(ErrKEKNotConfigured, "applier: rotation requires KEK unwrapper (Stage 6B)")
}

const localEpochMaskU64 uint64 = 0xFFFF
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

As noted in the comment for line 117, this constant should be renamed to reflect its purpose of masking the FullNodeID. Additionally, since it is a constant used for bitwise operations, it can be untyped for better flexibility.

Suggested change
const localEpochMaskU64 uint64 = 0xFFFF
const nodeIDMask = 0xFFFF

… + FSM WithEncryption wiring

Second slice of Stage 6A per the PR #762 plan. Wires the
§6.3 Applier (landed in part 1) into the FSM construction
path with a Pebble-backed WriterRegistryStore adapter so the
0x03 RegisterEncryptionWriter dispatch path is fully
functional end-to-end. Mutator-RPC wiring in
registerEncryptionAdminServer remains OFF per Stage 5D's
posture — re-enabled in Stage 6B alongside KEK plumbing
gated on (--encryption-enabled AND KEKConfigured()).

## What this commit ships

### store/encryption_glue.go (additions)

- pebbleWriterRegistry — internal adapter satisfying
  encryption.WriterRegistryStore. Both Get and Set hold
  dbMu for read only (the apply path's applyMu in
  kv/fsm.go is the authoritative serialiser for the
  §4.1 case-dispatch read-modify-write). Set uses
  pebble.Sync so the row is durable before FSM apply
  returns — a crash between FSM dispatch returning
  success and Pebble flushing the WAL would otherwise
  leave the Raft log claiming the entry applied but
  the registry row missing on restart, violating §4.1
  case 2's monotonic-epoch invariant.

- WriterRegistryFor(MVCCStore) factory — type-asserts
  against *pebbleStore and returns the adapter, or
  ErrUnsupportedStoreForWriterRegistry if the store is
  not the Pebble-backed implementation. Construction-time
  refusal so the failure surface is "binary refuses to
  start" rather than "FSM apply panics inside Raft loop".

- ErrUnsupportedStoreForWriterRegistry — new sentinel.

### store/writer_registry_test.go (new)

Three table-driven unit tests:

- TestWriterRegistryFor_RoundTrip — fresh Pebble store,
  missing row reads as (nil, false, nil), Set writes,
  Get returns byte-identical value.
- TestWriterRegistryFor_Overwrite — Set on existing key
  replaces the value; no MVCC versioning, no append.
- TestWriterRegistryFor_RejectsNonPebble — non-Pebble
  MVCCStore stub returns ErrUnsupportedStoreForWriterRegistry
  at construction.

### main.go (FSM construction site, line 694)

Replaces `sm := kv.NewKvFSMWithHLC(st, clock)` with the
applier-wired form:

  reg, err := store.WriterRegistryFor(st)
  if err != nil {
      _ = st.Close()
      return nil, nil, errors.Wrapf(err, ...)
  }
  applier, err := encryption.NewApplier(reg)
  if err != nil {
      _ = st.Close()
      return nil, nil, errors.Wrapf(err, ...)
  }
  sm := kv.NewKvFSMWithHLC(st, clock, kv.WithEncryption(applier))

Both error paths Close the store before returning so a
construction-time failure doesn't leak the Pebble handle.
Adds `internal/encryption` to the import list (the only
new import in main.go).

## What this commit does NOT do

- registerEncryptionAdminServer is unchanged. Mutating
  EncryptionAdmin RPCs continue to refuse with
  FailedPrecondition exactly as Stage 5D left them.
  The existing TestEncryptionAdmin_MutatingRPCRefusedUntilStage6
  test (both `sidecar_empty` and `sidecar_set` subcases)
  passes unchanged.

- The applier's ApplyBootstrap / ApplyRotation paths still
  return ErrKEKNotConfigured. Stage 6B re-enables mutator
  wiring AND threads in the KEK unwrapper at the same
  time so a re-enabled mutator never hits the
  ErrKEKNotConfigured halt path.

## Stage 5D safety boundary

The two PR760-merged regression tests pass on this branch:

- TestEncryptionAdminFullNodeID_DistinctPerRaftId — PASS
- TestEncryptionAdmin_MutatingRPCRefusedUntilStage6/sidecar_empty — PASS
- TestEncryptionAdmin_MutatingRPCRefusedUntilStage6/sidecar_set — PASS
- TestRegisterEncryptionAdminServer_Registers — PASS

The new FSM wiring path does not weaken any of these. The
applier exists internally; the gRPC layer continues to
refuse the operator-facing mutators that would route into
it.

## Five-lens self-review

1. Data loss — pebbleWriterRegistry uses pebble.Sync; FSM
   apply only returns success after the row is durable.
   Construction-time error paths Close the Pebble store
   on failure (no handle leak).
2. Concurrency / distributed failures — dbMu.RLock is
   sufficient because the apply path's applyMu in
   kv/fsm.go serialises all 0x03 dispatches; the read-
   modify-write pattern in Applier.ApplyRegistration is
   safe under that lock.
3. Performance — one Get + at most one Set per
   ApplyRegistration. pebble.Sync adds wal-flush cost,
   bounded.
4. Data consistency — writer-registry rows live outside
   the MVCC namespace (no commit timestamps, no MVCC
   visibility). Pebble Sync semantics match the §4.1
   contract that a registered writer's first nonce-issuing
   write under that DEK is durably preceded by its
   registry row.
5. Test coverage — 3 new store-package unit tests for the
   adapter; the part-1 applier tests already cover the
   applier-side dispatch; Stage 5D regression tests
   confirm no operator-surface change.

## Verification

- go test -race -timeout=60s -run 'TestWriterRegistryFor' ./store/... — PASS
- go test -race -timeout=60s -run 'TestApply|TestNewApplier' ./internal/encryption/... — PASS
- go test -race -timeout=60s -run 'TestEncryptionAdmin|TestRegisterEncryptionAdminServer' . — PASS
- go test -race -timeout=180s -run 'TestEncrypt|TestApplier|TestFSM|TestKvFSM|TestWriterRegistry' ./... — PASS
- go build ./... — PASS
- golangci-lint run ./... — 0 issues on touched files

The full adapter test suite has a pre-existing 120s-timeout
flake on race-mode (SQS reaper / HLC lease renewal long-
running goroutines visible in the dump). The encryption-
specific adapter tests pass cleanly when run in isolation;
the flake is unrelated to this PR's changes (no adapter
code is touched). Will run the suite again before marking
the PR ready for review.
@bootjp bootjp marked this pull request as ready for review May 16, 2026 17:53
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 16, 2026

@claude review

Stage 6A complete (2 commits) — applier scaffolding + Pebble-backed FSM wiring. Marked ready for review.

What this PR ships

Commit Scope
e2475b84 Part 1 internal/encryption/applier.go (concrete Applier + WriterRegistryStore interface) + internal/encryption/applier_test.go (7 unit tests covering §4.1 case 1/2/3/4 dispatch, construction refusal, store-error propagation, 6B-deferred stub contracts) + ErrKEKNotConfigured sentinel
8bbcd03f Part 2 store/encryption_glue.go (pebbleWriterRegistry adapter + WriterRegistryFor factory + ErrUnsupportedStoreForWriterRegistry) + store/writer_registry_test.go (3 tests) + main.go:694 FSM construction now wires kv.WithEncryption(applier)

Stage 5D safety boundary preserved

registerEncryptionAdminServer is untouched. Mutator RPCs continue to refuse with FailedPrecondition exactly as Stage 5D left them. Both Stage-5D regression tests pass:

  • TestEncryptionAdmin_MutatingRPCRefusedUntilStage6/sidecar_empty — PASS
  • TestEncryptionAdmin_MutatingRPCRefusedUntilStage6/sidecar_set — PASS

The applier exists internally and dispatches end-to-end on direct FSM.Apply calls; operator-facing surface is unchanged from PR #760.

Caller audit (semantic changes)

  • kv.NewKvFSMWithHLC(st, clock) at main.go:694 → adds variadic option kv.WithEncryption(applier). No other callers in production code; test callers in kv/coordinator_dispatch_test.go and kv/coordinator_leader_test.go continue to call without options and remain in the pre-Stage-4 byte-for-byte default per the WithEncryption godoc. Verified with grep -rn 'NewKvFSMWithHLC' --include='*.go'.

  • New sentinels ErrKEKNotConfigured and ErrUnsupportedStoreForWriterRegistry are new — no existing callers. The kv/fsm dispatch will mark any non-nil applier return with ErrEncryptionApply regardless of inner marking.

Verification

  • go test -race -timeout=60s -run 'TestNewApplier|TestApplyRegistration|TestApplyBootstrap|TestApplyRotation|TestWriterRegistry|TestEncryptionAdmin|TestRegisterEncryptionAdminServer' ./... — PASS
  • go build ./... — PASS
  • golangci-lint run ./... — 0 issues on touched files

A 120s-timeout flake exists in the full adapter test suite under -race (long-running SQS reaper / HLC lease renewal goroutines); unrelated to this PR's changes — no adapter code is touched. Encryption-specific adapter tests pass cleanly in isolation.

Plan

This is the first of six PRs (PR #762's 6A → 6F). 6B will follow with KEK plumbing + --encryption-enabled flag + mutator-RPC gate.

@claude
Copy link
Copy Markdown

claude Bot commented May 16, 2026

Claude finished @bootjp's task in 4m 29s —— View job


Code Review: Stage 6A — §6.3 Applier scaffolding + FSM wiring

Todo list

  • Read changed files and understand the implementation
  • Review internal/encryption/applier.go (core logic)
  • Review internal/encryption/applier_test.go (test coverage)
  • Review internal/encryption/errors.go (sentinels)
  • Review store/encryption_glue.go (Pebble adapter)
  • Review store/writer_registry_test.go (glue tests)
  • Review main.go (FSM wiring)
  • Assess existing gemini-code-assist feedback
  • Post comprehensive review

Five-lens review

1. Data loss

Clean. pebbleWriterRegistry.SetRegistryRow uses pebble.Sync so the WAL is flushed before FSM apply returns — no window where the Raft log claims the entry applied but the row is missing on restart. Error propagation is correct throughout: every GetRegistryRow and SetRegistryRow error is returned (never swallowed), and the FSM dispatch wraps them with ErrEncryptionApply before invoking haltErr. The main.go wiring correctly closes the Pebble store if either WriterRegistryFor or NewApplier fail, preventing a handle leak on startup failure.

2. Concurrency / distributed failures

Clean. The Applier carries zero in-memory state — all state lives in the WriterRegistryStore — so there is no stale-state leak across snapshot restore. pebbleWriterRegistry holds dbMu.RLock() (not Write-lock) for both Get and Set: this is correct because the apply path's applyMu in kv/fsm.go already serialises the §4.1 read-modify-write, and RLock is only needed to protect against the Pebble DB pointer being swapped during snapshot restore.

One structural note: GetRegistryRow returns (nil, false, err) if closer.Close() fails (after already copying the value into out). This is intentional fail-closed behaviour — in Pebble closer failures on a single-key Get are extremely rare — and it's correct to surface the error rather than silently return a value that might be unreliable.

3. Performance

One Get + at most one Set per ApplyRegistration call. pebble.Sync adds WAL-flush latency but is bounded and appropriate — this path is only called for 0x03 entries which are infrequent operator actions, not per-user-write hot path.

4. Data consistency

The §4.1 four-case dispatch is correctly implemented:

  • Case 1 (no row): insert with FirstSeen = LastSeen = LocalEpoch
  • Case 2 (same FullNodeID, LocalEpoch > LastSeen): advance LastSeen, preserve FirstSeen
  • Case 3 (LocalEpoch <= LastSeen): halt (<= correctly covers both equal/replay and strict rollback) ✓
  • Case 4 (FullNodeID mismatch under same uint16): halt ✓

ApplyBootstrap and ApplyRotation return ErrKEKNotConfigured, correctly deferred to Stage 6B. The safety boundary from Stage 5D is preserved — registerEncryptionAdminServer is untouched.

5. Test coverage

Comprehensive. All four §4.1 cases are covered. TestApplyRegistration_Case3_Rollback correctly tests both the equal_epoch (boundary) and lower_epoch subcases for the <= condition. TestApplyRegistration_StoreErrorPropagates covers both Get and Set fault injection. The store/writer_registry_test.go round-trip, overwrite, and type-assertion refusal tests complete the Pebble adapter surface. The noopMVCCStub embeds store.MVCCStore cleanly so the type-assertion path is the only one exercised.


Issues

localEpochMaskU64 is a misleading constant name — agree with gemini-code-assist

internal/encryption/applier.go:183

const localEpochMaskU64 uint64 = 0xFFFF

The name strongly implies this masks the LocalEpoch field, but LocalEpoch in RegistrationPayload is already a uint16 and needs no masking. The mask is applied to FullNodeID at line 117 to extract its lower 16 bits for the Pebble key. A reader unfamiliar with the surrounding context will misread the intent and may incorrectly conclude the mask has something to do with the epoch rollback check.

The gemini suggestion of nodeIDMask accurately describes the purpose; fullNodeIDMask or nodeID16Mask would be equally clear. Additionally, removing the uint64 type annotation makes it an untyped integer constant, which is more idiomatic Go for bit masks used in a single context.

Suggested fix: Fix this →

Gemini's second concern (NodeID validation) is already handled

Gemini's comment at line 117 says to "verify that the NodeID from the decoded context matches the NodeID of the current change being processed." This check already exists at line 138:

if prev.FullNodeID != p.FullNodeID {
    // case 4: uint16 collision under different full ids
    return errors.Wrapf(ErrEncryptionApply, ...)
}

No change needed here.


Minor observations (non-blocking)

  • Tests use inline 0xFFFF literal (applier_test.go:71, :109, :167). Since localEpochMaskU64 is unexported and the test is in package encryption_test, this is unavoidable without exporting the constant. Once the constant is renamed (per the issue above), consider whether to export it — RegistryKey already accepts a uint16, so callers that do their own truncation could use a named exported constant. This is a Stage 6A polish item at most.

  • No cross-DEK isolation test: there is no test verifying that (DEKID=1, nodeID=A) and (DEKID=2, nodeID=A) are stored as separate rows. This is inherently correct given RegistryKey encodes the dekID, and registry_test.go already covers the key encoding, so this is low priority.

  • closer.Close() error discards the already-copied out in GetRegistryRow — this is intentional fail-closed behaviour and clearly the right choice. Worth a brief inline comment explaining why (out is discarded on closer error rather than returned) to prevent a future reader from "fixing" it.


Summary

The implementation is correct and well-structured. The only real issue is the misleading localEpochMaskU64 constant name, which is a naming bug that will confuse future readers during 6B/6C development. The rest of the five-lens review is clean. Recommend fixing the constant name before merge; everything else is non-blocking.

…lEpochMaskU64 → nodeIDMask

gemini-code-assist round-1 review on commit e2475b8 flagged
the constant name 'localEpochMaskU64' as misleading at the
applier's writer-registry-key construction site. The mask is
applied to p.FullNodeID to extract its lower 16 bits per
§4.1's registry-key shape — not to a local_epoch field, which
is already uint16 and needs no masking.

The mask happens to share the value 0xFFFF with the unrelated
local-epoch field width, so the type checker is happy, but a
reader scanning the code for 'why are we masking local_epoch'
would be confused. The original name was copied verbatim from
adapter/encryption_admin.go (where it's used for local_epoch);
the applier inherited the misnomer.

## Fix

- Rename localEpochMaskU64 → nodeIDMask
- Drop explicit uint64 typing so the cast at the call site is
  explicit (no implicit uint64 promotion) — also addresses
  gemini's secondary suggestion on the constant declaration.
- Add a godoc explaining the rename motivation and the
  shared-value-different-purpose distinction.

The numeric value (0xFFFF) and the mask behaviour are
identical; this is a pure renaming. No callers changed (only
one call site exists in applier.go).

## Caller audit

Single rename inside internal/encryption/applier.go. No exports
change. The other localEpochMaskU64 in adapter/encryption_admin.go
remains correctly named for that file's local_epoch usage and is
untouched.

## Verification

- go build ./... — PASS
- go test -race -timeout=60s -run 'TestApply|TestNewApplier|TestWriterRegistry' ./internal/encryption/... ./store/... — PASS
- golangci-lint run ./internal/encryption/... — 0 issues
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 16, 2026

@claude review

Round-2 (commit cd760773): addressing gemini-code-assist round-1 medium — renamed misleading localEpochMaskU64nodeIDMask in internal/encryption/applier.go. The mask is applied to p.FullNodeID for §4.1 registry-key narrowing, not local_epoch masking; the original name was copied verbatim from adapter/encryption_admin.go and inherited the misnomer. Numeric value (0xFFFF) unchanged.

Also dropped explicit uint64 typing per gemini's secondary suggestion so the cast at the call site is explicit (no implicit uint64 promotion).

Pure rename, single call site, no exports changed. adapter/encryption_admin.go's localEpochMaskU64 is correctly named for that file's local_epoch usage and is untouched.

Verification

  • go build ./... — PASS
  • go test -race -timeout=60s -run 'TestApply|TestNewApplier|TestWriterRegistry' ./internal/encryption/... ./store/... — PASS
  • golangci-lint run ./internal/encryption/... — 0 issues

@claude
Copy link
Copy Markdown

claude Bot commented May 16, 2026

Claude finished @bootjp's task in 3m 15s —— View job


Code Review: Stage 6A Round-2 — nodeIDMask rename + full review

Todo list

  • Update comment with todo list
  • Read internal/encryption/applier.go — verify rename + full logic review
  • Read internal/encryption/applier_test.go — check test coverage
  • Read internal/encryption/errors.go — sentinel review
  • Read store/encryption_glue.go — Pebble adapter review
  • Read store/writer_registry_test.go — glue test review
  • Read main.go diff — FSM wiring review
  • Post comprehensive review

Rename review (round-2 focus)

The rename is correct. nodeIDMask accurately describes the mask's purpose (narrowing p.FullNodeID to its low 16 bits for the registry key), as opposed to the inherited localEpochMaskU64 name which implied the mask operated on the local_epoch field. Numeric value (0xFFFF) and the //nolint:gosec annotation at the call site are unchanged — both correct.

The untyped constant form (const nodeIDMask = 0xFFFF) is the right choice: the narrowing is made visible at the call site by the explicit uint16(...) cast, not buried in a typed constant's implicit promotion.

One actionable issue: the constant's godoc contains PR/review meta-history (applier.go:183-192):

// nodeIDMask narrows a uint64 full_node_id to its low 16 bits to
// match the §4.1 writer-registry key shape
// (`!encryption|writers|<dek_id>|<be2 uint16(node_id)>`). The
// mask happens to share the value 0xFFFF with the unrelated
// local_epoch field width, but the semantic purpose here is
// FullNodeID truncation — gemini-code-assist PR #765 round-1
// medium flagged the original `localEpochMaskU64` name as
// misleading at this call site. Untyped so the cast at the
// call site is explicit (no implicit uint64 promotion).

The "— gemini-code-assist PR #765 round-1 medium flagged the original..." sentence is review context that belongs in the PR description, not the constant's godoc. Per the project's conventions, cross-references to issues/reviewers rot as the codebase evolves. The semantic reason (the mask disambiguates FullNodeID truncation from unrelated local_epoch masking) is preserved by the name itself; only the meta-reference needs to go. Suggested trim:

// nodeIDMask narrows a uint64 full_node_id to its low 16 bits to
// match the §4.1 writer-registry key shape
// (`!encryption|writers|<dek_id>|<be2 uint16(node_id)>`). The mask
// shares the value 0xFFFF with the unrelated local_epoch field width;
// the name distinguishes this FullNodeID-truncation use from the
// local_epoch masking in adapter/encryption_admin.go. Untyped so the
// cast at the call site is explicit (no implicit uint64 promotion).
const nodeIDMask = 0xFFFF

Fix this →

Informational (pre-existing, out of scope for this PR): adapter/encryption_admin.go:505 also uses localEpochMaskU64 to mask FullNodeID (not local_epoch), despite its name. The trigger comment says the adapter constant is "correctly named for that file's local_epoch usage" but that's not accurate — it's used as trunc := uint16(id & localEpochMaskU64) where id is w.GetFullNodeId(). This is pre-existing tech debt; a follow-up rename in the adapter file would complete the cleanup started here.


Applier logic review (applier.go)

The §4.1 four-case dispatch is correct:

  • Case 1 (no row): inserts with FirstSeen = LastSeen = LocalEpoch. ✓
  • Case 2 (FullNodeID matches, LocalEpoch > LastSeen): advances LastSeen, preserves FirstSeen. ✓
  • Case 3 (FullNodeID matches, LocalEpoch ≤ LastSeen): epoch rollback/replay — wraps ErrEncryptionApply. ✓
  • Case 4 (uint16 collision under different FullNodeID): wraps ErrEncryptionApply. ✓

The dispatch order matters: case 4 is detected before case 3 because prev.FullNodeID != p.FullNodeID is checked first. That's correct — a collision should be identified before checking epoch monotonicity.

ApplyBootstrap and ApplyRotation return ErrKEKNotConfigured wrapped so the kv/fsm dispatch's ErrEncryptionApply seam fires cleanly when Stage 6B fills these in. Correct scaffolding.

Minor nit: the struct-level comment at line 46 says "ApplyRegistration — the §4.1 case 1 / case 2 dispatch" but immediately lists all four cases. Consider "four-case dispatch" instead to match the function-level doc; current is slightly misleading.


Pebble adapter review (store/encryption_glue.go)

GetRegistryRow: correct pattern — copy bytes before closing the closer, handle pebble.ErrNotFound as the not-found sentinel (not an error). The closer.Close() error is returned rather than ignored, which is the right fail-safe posture. ✓

SetRegistryRow using RLock for a write: this is documented and intentional — dbMu guards only the *pebble.DB pointer swap during snapshot restore; the FSM's applyMu is the serializer for the read-modify-write. This is safe given the invariant and clearly commented. ✓

pebble.Sync on Set: correct — ensures the WAL is flushed before apply returns, which is required for §4.1's monotonic-epoch invariant to survive a crash between FSM dispatch and Pebble flush. ✓

WriterRegistryFor type assertion: clean construction-time guard. The errors.WithStack wrapping of ErrUnsupportedStoreForWriterRegistry is correct so callers can errors.Is it regardless of stack depth. ✓


Test coverage review

applier_test.go — all 7 tests are well-structured:

  • Case 1 verifies both FirstSeen and LastSeen equal the inserted epoch ✓
  • Case 2 verifies FirstSeen is preserved and LastSeen advances ✓
  • Case 3 covers both equal-epoch and lower-epoch sub-cases ✓
  • Case 4 uses a carefully-crafted pair of FullNodeID values that collide on uint16 truncation ✓
  • Store-error propagation covers both Get and Set paths ✓
  • ErrEncryptionApply wrapping is verified in both halt paths ✓

writer_registry_test.go — the 3 Pebble integration tests use a real on-disk Pebble store via t.TempDir():

  • Round-trip (missing → Set → Get) ✓
  • Overwrite semantics (second Set replaces first) ✓
  • Non-Pebble rejection via a noopMVCCStub (embedded interface that panics if any method is called) — the stub being panic-on-call is exactly right because if WriterRegistryFor incorrectly proceeds past the type assertion, the test will fail loudly rather than silently returning wrong data. ✓

Main.go wiring review

The buildShardGroups wiring is clean:

  • WriterRegistryFor(st) at shard startup — fails fast if the store is not Pebble-backed ✓
  • st.Close() on both WriterRegistryFor and NewApplier error paths — prevents resource leak on startup failure ✓
  • NewApplier cannot return an error here (reg is non-nil if WriterRegistryFor succeeded), so the second error path is dead code; but the defensive cleanup is harmless and correct ✓
  • kv.WithEncryption(applier) threads the applier into the FSM dispatch ✓
  • The gRPC mutator gate (registerEncryptionAdminServer) is confirmed untouched — Stage 5D safety boundary preserved ✓

Summary

The round-2 rename is correct and the full implementation is solid. One item worth addressing before merge:

Severity Location Issue
Minor applier.go:183-192 Trim review-history meta-reference from nodeIDMask godoc

Everything else — §4.1 dispatch correctness, pebble.Sync durability, RLock justification, error propagation, construction-time guards, and test coverage — is in good shape. The Stage 5D safety boundary is preserved.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@main.go`:
- Around line 705-714: The error branches that return after failing to construct
the writer registry or encryption applier close only st and return, leaking
previously started group runtimes; update those branches (the error returns
after store.WriterRegistryFor(st) and after encryption.NewApplier(reg)) to run
the same cleanup used later: iterate over the previously appended runtimes slice
and stop/close each runtime (the same method used in the normal cleanup path),
then close st and return the wrapped error referencing g.id so no runtimes are
left running on partial startup.

In `@store/encryption_glue.go`:
- Around line 86-92: The function WriterRegistryFor currently allows a typed-nil
*pebbleStore to pass the type assertion and later panic; after the assertion for
ps in WriterRegistryFor, add an explicit nil check (if ps == nil) and return nil
with errors.WithStack(ErrUnsupportedStoreForWriterRegistry) to fail fast; ensure
the construction of &pebbleWriterRegistry{s: ps} only happens when ps is non-nil
so the registry never wraps a typed-nil pebbleStore.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 724bb42a-272e-464f-be38-62df0db2845d

📥 Commits

Reviewing files that changed from the base of the PR and between d8a5d93 and 8bbcd03.

📒 Files selected for processing (6)
  • internal/encryption/applier.go
  • internal/encryption/applier_test.go
  • internal/encryption/errors.go
  • main.go
  • store/encryption_glue.go
  • store/writer_registry_test.go

Comment thread main.go
Comment thread store/encryption_glue.go
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8bbcd03f9b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/encryption/applier.go Outdated
Comment on lines +144 to +145
if p.LocalEpoch <= prev.LastSeenLocalEpoch {
// case 3: epoch rollback / replay
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Treat equal local_epoch as idempotent replay

Raft apply is designed to replay committed entries after restart until the latest FSM snapshot, so a RegisterEncryptionWriter entry can be applied again with the same (dek_id, full_node_id, local_epoch). With the current <= check, that legitimate replay is classified as rollback and returns ErrEncryptionApply, which halts apply and can leave a node stuck in a restart loop after any crash before snapshotting this entry. The rollback guard should only reject strictly smaller epochs; an equal epoch for the same writer should be a no-op.

Useful? React with 👍 / 👎.

…coderabbit Major×2 (typed-nil reject + runtimes cleanup)

Three actionable findings on round-2 (commit cd76077), all
addressed in this commit:

## codex P1 — Equal local_epoch must be idempotent, not halt

Raft replays committed entries after restart until the latest
FSM snapshot. A RegisterEncryptionWriter entry can legitimately
be applied again with the same (dek_id, full_node_id,
local_epoch). The round-2 case-3 check used <= which classified
that legit replay as rollback and halted apply — pinning the
node in a halt-on-replay loop after any crash before
snapshotting the entry.

Fix: split the case-3 branch into two paths.
  - equal epoch + same FullNodeID → no-op return (idempotent
    replay; row already reflects this epoch).
  - strictly-smaller epoch + same FullNodeID → halt with
    ErrEncryptionApply (true rollback).

Updated test split:
  - TestApplyRegistration_Case3_Rollback now exercises ONLY
    the strictly-smaller path; the previous equal-epoch
    subtest was wrong and is removed.
  - TestApplyRegistration_Case2_IdempotentReplay (new) pins
    the no-op-on-replay invariant: apply the same entry
    twice, second apply returns nil, row is unchanged.

Godoc + struct-doc updated to name the case-2-idempotent
branch explicitly.

## coderabbit Major (line 92) — Typed-nil Pebble store

A typed-nil (*pebbleStore)(nil) passes the type assertion
`s.(*pebbleStore)` with ok=true but ps==nil; the adapter
would then nil-deref on first call. Add `ps == nil` check
after the assertion so the failure mode is "binary refuses
to start" rather than "FSM apply panics deep in Raft loop".

## coderabbit Major (line 714) — Close prior runtimes on new error paths

In `main.go::startRaftServers`, the two new error paths
added in Part 2 (WriterRegistryFor failure and NewApplier
failure) only closed `st` before returning. In multi-group
startup with N > 1, this leaked previously-appended
runtimes (each carries an etcd-raft Engine background
goroutine, a Pebble handle, and a HLC lease renewer). The
existing `buildRuntimeForGroup` error path at line 717+
already does the right thing (range + Close); my new
branches matched the pattern.

## Caller audit (semantic changes)

- Applier.ApplyRegistration: equal-epoch semantics flipped
  from "halt apply" to "no-op return nil". This is strictly
  fail-OPEN — the dispatch caller (kv/fsm_encryption.go:89)
  is the only production caller and it just propagates the
  return value, so a relaxation of error returns can only
  make the dispatch path MORE permissive, never less. Test
  caller (kv/fsm_encryption_test.go fakeApplier) is a mock
  and is unaffected.

- store.WriterRegistryFor: stricter at construction time
  (typed-nil now rejected). The only production caller
  (main.go:705) passes a freshly-constructed
  store.NewPebbleStore return, which is never typed-nil, so
  observed behavior at the production call site is unchanged.

- main.go error paths: internal control flow only. No
  callers — these are inside the function body.

## Verification

- go build ./... — PASS
- go test -race -timeout=60s -run 'TestApply|TestNewApplier|TestWriterRegistry|TestEncryptionAdmin|TestRegisterEncryptionAdminServer' ./internal/encryption/... ./store/... . — PASS
- golangci-lint run ./internal/encryption/... ./store/... — 0 issues
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 16, 2026

@claude review

Round-3 (commit 9ecae085): addressing 3 actionable findings on round-2 — codex P1 + 2 coderabbit Majors.

codex P1 — Equal local_epoch must be idempotent (Raft replay), not halt

Raft replays committed entries after restart until the latest FSM snapshot. The round-2 case-3 <= check classified legit equal-epoch replay as rollback and halted apply — pinning the node in a halt-on-replay loop after any crash before snapshotting the entry.

Fix: split into two paths:

  • equal epoch + same FullNodeID → no-op return nil (idempotent; row already reflects this epoch)
  • strictly-smaller epoch + same FullNodeID → halt with ErrEncryptionApply (true rollback)

Test surgery: TestApplyRegistration_Case3_Rollback now exercises ONLY the strictly-smaller path; new TestApplyRegistration_Case2_IdempotentReplay pins the no-op-on-replay invariant (same entry twice → row byte-identical).

coderabbit Major (line 92) — Typed-nil Pebble store

A typed-nil (*pebbleStore)(nil) passes the type assertion with ok=true but ps==nil; the adapter would nil-deref on first call. Added ps == nil check.

coderabbit Major (line 714) — Close prior runtimes on new error paths

The two new main.go error paths only closed st; previously-appended runtimes were leaked (etcd-raft Engine goroutine, Pebble handle, HLC lease renewer). Added the for _, rt := range runtimes { rt.Close() } cleanup loop to match the existing buildRuntimeForGroup failure path pattern at line 717+.

Caller audit (semantic changes)

  • ApplyRegistration: equal-epoch flipped from "halt" to "no-op" — strictly fail-OPEN; single production caller (kv/fsm_encryption.go:89) just propagates the return, so relaxation only makes dispatch MORE permissive. Test mock fakeApplier unaffected.
  • WriterRegistryFor: stricter at construction (typed-nil rejected). Only production caller (main.go:705) passes a fresh NewPebbleStore return, never typed-nil, so observed behavior at the call site is unchanged.
  • main.go new error paths: internal control flow only, no callers.

Verification

  • go build ./... — PASS
  • go test -race -timeout=60s -run 'TestApply|TestNewApplier|TestWriterRegistry|TestEncryptionAdmin|TestRegisterEncryptionAdminServer' ./internal/encryption/... ./store/... . — PASS
  • golangci-lint run ./internal/encryption/... ./store/... — 0 issues

@claude
Copy link
Copy Markdown

claude Bot commented May 16, 2026

Claude finished @bootjp's task in 2m 12s —— View job


Code Review: Stage 6A Round-3 — Codex P1 + 2 CodeRabbit Majors

Todo list

  • Post initial comment
  • Read internal/encryption/applier.go — verify idempotent-replay fix
  • Read internal/encryption/applier_test.go — verify test changes
  • Read store/encryption_glue.go — verify typed-nil fix
  • Read main.go — verify runtime cleanup fix
  • Post comprehensive review

All three fixes are correct. Detailed verification below.


Fix 1: Equal local_epoch idempotency (Codex P1)

Verdict: Correct. The split is clean and the dispatch ordering is right:

case 4 (collision): full_node_id mismatch → halt  [line 152]
case 2-idempotent: epoch == LastSeen → no-op nil  [line 158]
case 3: epoch < LastSeen → halt                   [line 171]
case 2: epoch > LastSeen → advance                [line 179]

The equal-epoch path returns nil without calling SetRegistryRow — the row is already correct, so nothing touches it. That's precisely the right no-op behavior: the idempotent path does not re-insert or re-validate, it simply returns successfully.

The TestApplyRegistration_Case3_Rollback test now exercises only the strict-rollback case (epoch 5 → epoch 4), and TestApplyRegistration_Case2_IdempotentReplay pins the no-op invariant correctly: the byte-comparison string(before) != string(after) will catch any unintended SetRegistryRow call that mutates the row during replay. ✓

Minor observation on test comments: applier_test.go:155 and applier.go:107 (ApplyRegistration godoc) include (PR #765 round-2 codex P1) inline references. The same issue was flagged in round-2 for nodeIDMask's godoc — see the consolidated note below.


Fix 2: Typed-nil Pebble store (store/encryption_glue.go:94)

Verdict: Correct. The !ok || ps == nil guard at WriterRegistryFor:94 catches both shapes:

  • !ok: wrong dynamic type (non-Pebble store)
  • ps == nil: typed-nil (*pebbleStore)(nil) that would pass the type assertion but nil-deref on first dereference of ps.dbMu or ps.db

Construction-time rejection means the failure mode is "binary refuses to start" rather than a panic deep inside the Raft apply loop. ✓


Fix 3: Close prior runtimes on error paths (main.go:707-719)

Verdict: Correct. Both new error paths now run the cleanup loop before returning:

for _, rt := range runtimes {
    rt.Close()
}
_ = st.Close()
return nil, nil, errors.Wrapf(err, ...)

This matches the existing buildRuntimeForGroup failure pattern at line 724. The loop correctly closes all previously-appended runtimes (those from earlier iterations). The current group's st is closed separately. The ordering (runtimes first, then st) is fine because the current st was not yet added to runtimes. ✓


Outstanding: Review-meta references in comments

This was flagged in round-2 for nodeIDMask and was not addressed in round-3. Round-3 added three more instances of the same pattern. All four should be trimmed before merge per the project convention ("don't reference the current task, fix, or callers — those belong in the PR description and rot as the codebase evolves"):

File Lines Meta-reference
applier.go 107 "— codex PR #765 round-2 P1" in ApplyRegistration godoc
applier.go 165–166 "(PR #765 round-2 codex P1)" in case-2-idempotent inline comment
applier.go 217–220 "gemini-code-assist PR #765 round-1 medium flagged..." in nodeIDMask godoc (carry-over from round-2)
store/encryption_glue.go 92–93 "— coderabbit PR #765 round-2 Major" in WriterRegistryFor inline comment
applier_test.go 155 "(PR #765 round-2 codex P1)" in TestApplyRegistration_Case2_IdempotentReplay

In each case the semantic explanation is valuable (keep it); only the PR #N round-M reviewer-name attribution belongs in the PR description, not the source. Fix this →


Summary

Fix Verdict
Codex P1: equal-epoch → no-op idempotent ✓ Correct
CodeRabbit Major: typed-nil Pebble rejection ✓ Correct
CodeRabbit Major: prior-runtime cleanup on error paths ✓ Correct
Review-meta comment cleanup (round-2 carry-over + new) Non-blocking — trim before merge

The implementation is correct across all three fixes. The only remaining item before merge is the meta-reference comment cleanup — the semantic content is fine, just the PR/reviewer attribution attribution should be stripped.

…tion from inline comments

claude[bot] r3 verdict was 'all three fixes correct' but
flagged 5 inline-comment sites that referenced 'PR #765
round-N reviewer-name' meta. CLAUDE.md:

  Don't reference the current task, fix, or callers
  ("used by X", "added for the Y flow", "handles the case
  from issue #123"), since those belong in the PR
  description and rot as the codebase evolves.

This commit strips the attribution from all 5 sites while
preserving the semantic content (why the code does what
it does, separate from who flagged it).

## Sites cleaned up

- internal/encryption/applier.go:107 — ApplyRegistration
  godoc case-2-idempotent paragraph: removed
  '— codex PR #765 round-2 P1'.

- internal/encryption/applier.go:158-162 — case-2-idempotent
  inline comment: removed '(PR #765 round-2 codex P1)'.

- internal/encryption/applier.go:213-217 — nodeIDMask godoc
  (round-2 carry-over): rewrote
  'gemini-code-assist PR #765 round-1 medium flagged the
  original localEpochMaskU64 name as misleading at this
  call site' → 'the name distinguishes this
  FullNodeID-truncation use from the local_epoch masking
  in adapter/encryption_admin.go' (semantic content
  preserved, attribution dropped).

- store/encryption_glue.go:92 — WriterRegistryFor typed-nil
  inline comment: removed '— coderabbit PR #765 round-2
  Major'.

- internal/encryption/applier_test.go:151 —
  TestApplyRegistration_Case2_IdempotentReplay godoc:
  removed '(PR #765 round-2 codex P1)'.

## Caller audit

Pure comment edit — no callers to audit. Code unchanged.

## Verification

- go build ./... — PASS
- go test -race -timeout=60s ./internal/encryption/... ./store/... . — PASS
- golangci-lint run ./internal/encryption/... ./store/... — 0 issues
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 16, 2026

@claude review

Round-4 (commit fec23220): addressed your r3 carry-over note on PR/review-attribution comments per CLAUDE.md convention.

Changes

5 inline-comment sites stripped of PR #N round-M reviewer-name attribution; semantic content preserved (why-the-code-does-what-it-does, separate from who flagged it).

File Lines Before → After
internal/encryption/applier.go 107 ApplyRegistration godoc — removed — codex PR #765 round-2 P1
internal/encryption/applier.go 158-162 case-2-idempotent inline — removed (PR #765 round-2 codex P1)
internal/encryption/applier.go 213-217 nodeIDMask godoc (r2 carry-over) — rewrote 'gemini-code-assist PR #765 round-1 medium flagged...' as 'the name distinguishes this FullNodeID-truncation use from the local_epoch masking in adapter/encryption_admin.go'
store/encryption_glue.go 92 WriterRegistryFor typed-nil — removed — coderabbit PR #765 round-2 Major
internal/encryption/applier_test.go 151 TestApplyRegistration_Case2_IdempotentReplay godoc — removed (PR #765 round-2 codex P1)

Caller audit

Pure comment edit — no callers to audit. Code unchanged.

Verification

  • go build ./... — PASS
  • go test -race ./internal/encryption/... ./store/... . — PASS
  • golangci-lint run ./internal/encryption/... ./store/... — 0 issues

@claude
Copy link
Copy Markdown

claude Bot commented May 16, 2026

Claude finished @bootjp's task in 1m 15s —— View job


Code Review: Stage 6A Round-4 — Comment attribution cleanup

Todo list

  • Post initial comment
  • Read internal/encryption/applier.go — verify 3 comment sites
  • Read internal/encryption/applier_test.go — verify test comment
  • Read store/encryption_glue.go — verify WriterRegistryFor comment
  • Full logic re-check (no regressions from comment-only edit)
  • Post comprehensive review

Round-4 verdict: all 5 sites correctly cleaned

The commit is a pure comment edit (11 insertions, 13 deletions across 3 files); no code changed. Each attribution site was verified against the current branch:

File Lines Before → After
applier.go 107 "…snapshotting this entry — codex PR #765 round-2 P1.""…snapshotting this entry."
applier.go 162–165 "pin the node in a restart loop (PR #765 round-2 codex P1). The row…""pin the node in a restart loop. The row…"
applier.go 216–217 "gemini-code-assist PR #765 round-1 medium flagged the original localEpochMaskU64 name…""the name distinguishes this FullNodeID-truncation use from the local_epoch masking in adapter/encryption_admin.go"
encryption_glue.go 92 "…'FSM apply panics deep in Raft loop' — coderabbit PR #765 round-2 Major""…'FSM apply panics deep in Raft loop'."
applier_test.go 151–155 "…snapshotting the entry (PR #765 round-2 codex P1).""…snapshotting the entry."

In all cases the semantic content is preserved — the why of the code remains intact; only the reviewer-attribution phrases were dropped.

Scan for stragglers: grep -rn "PR #765\|round-[0-9]" in the five touched files returns zero hits. Pre-existing attribution references in sidecar.go, fsmwire/wire.go, and kek/file_test.go are from earlier PRs and out of scope for this PR — nothing new introduced.


Overall Stage 6A assessment

Across all four rounds, the PR is clean:

Round Changes Verdict
r1 Initial applier + Pebble adapter + FSM wiring Correct
r2 localEpochMaskU64nodeIDMask rename Correct
r3 Idempotent equal-epoch, typed-nil rejection, runtime cleanup Correct
r4 Strip PR/review-attribution from 5 comment sites Correct

No outstanding issues. Ready to merge.

@bootjp bootjp merged commit 4dc931d into main May 16, 2026
9 checks passed
@bootjp bootjp deleted the feat/encryption-stage6a branch May 16, 2026 18:48
bootjp added a commit that referenced this pull request May 16, 2026
…e scope limitation

codex r2 P1 on round-2 (commit 0f946c1): the triple-gate
encryptionMutatorsEnabled() is purely node-local; during a
staged rolling upgrade where some peers still run the pre-6B-2
binary or have the flags unset, a mutator proposal committed
by a fully-configured leader would reach those replicas'
appliers and HaltApply on the under-configured side.

## Resolution

This is a real concern but the cluster-wide readiness check
belongs at the §7.1 Phase-1 cutover RPC layer
(enable-storage-envelope) in Stage 6D, not in the node-local
flag gate. PR #762's milestone breakdown explicitly assigns
the Voters ∪ Learners capability fan-out gate to Stage 6D.

Adding a partial cluster-wide check in Stage 6B-2 would
duplicate the §7.1 capability gate and ship before the
operator-facing GetCapability fan-out is wired into the
admin CLI. The right resolution is to document the scope
limitation explicitly so:

  1. Operators reading the code understand the gate's scope
     and the rolling-upgrade discipline expected at this
     layer (same constraint as PR #765's 6A→6B caveat).
  2. The Stage 6D PR has a clear handoff point — the
     cluster-wide gate is its scope, and this docstring
     points forward to that.

## Doc-only

No code paths changed; the gate's three-flag readback is
unchanged. Single 30-line docstring expansion explaining:

  - the gate is THIS NODE local
  - what cluster-wide readiness means at §7.1
  - why Stage 6B-2 deliberately defers it to Stage 6D
  - the operator-discipline constraint matching PR #765

## Caller audit

Pure doc edit. No callers to audit.

## Verification

- go build ./... — PASS
- go test -race -timeout=60s -run 'TestEncryptionAdmin|TestRegisterEncryptionAdminServer' . — PASS
- golangci-lint run ./... — 0 issues on touched files
bootjp added a commit that referenced this pull request May 16, 2026
…eference

claude r3 verdict was 'Ready to merge' with one nice-to-have
note: the scope-boundary docstring cited PR #765 by number
but not the underlying design doc path. PR links can become
stale; a path to the docs/design/ file makes the constraint
findable without GitHub access.

Added a reference to docs/design/2026_04_29_partial_data_at_rest_encryption.md
(specifically the 6A rationale's 'Rolling 6A→6B upgrade
caveat' section). The PR #765 reference is preserved as the
historical trigger; the design doc path is the canonical
source of the rolling-upgrade discipline going forward.

Doc-only — no code paths touched.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant