Skip to content

feat(encryption): Stage 5 PR-C — BootstrapEncryption + bootstrap CLI + Validate()#759

Merged
bootjp merged 4 commits into
mainfrom
feat/encryption-admin-bootstrap
May 16, 2026
Merged

feat(encryption): Stage 5 PR-C — BootstrapEncryption + bootstrap CLI + Validate()#759
bootjp merged 4 commits into
mainfrom
feat/encryption-admin-bootstrap

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented May 16, 2026

Summary

Stage 5 PR-C of the data-at-rest encryption rollout (design doc:
docs/design/2026_04_29_partial_data_at_rest_encryption.md, §5.6
bootstrap, §6.1 admin.go, §6.6 admin commands). Closes the last
mutating-RPC gap in the proto.EncryptionAdmin surface and adds the
production-wiring guard the PR-B review carry-forward identified.

Production-inert: main.go does not yet register
EncryptionAdminServer; that wiring is Stage 5D along with the
§5.6 step 1a capability fan-out helper.

What lands

  • BootstrapEncryption server impl: proposes the §5.6 0x04
    OpBootstrap entry. Validates storage_dek_id != 0 != raft_dek_id,
    storage_dek_id != raft_dek_id, non-empty wrapped DEK pair,
    non-empty writer batch, each writer's full_node_id != 0 /
    local_epoch <= 0xFFFF, and 2 * len(writers) <= bootstrapBatchRowCap (mirrors fsmwire.maxBootstrapBatchCount).
    Expands each writer into a storage-DEK row + a raft-DEK row so
    the §4.1 nonce-uniqueness invariant covers both purposes from
    the first post-bootstrap entry.
  • elastickv-admin encryption bootstrap CLI: takes
    --storage-dek-id, --raft-dek-id, --wrapped-storage-dek
    (base64), --wrapped-raft-dek (base64), and repeated
    --writer=<full_node_id>:<local_epoch>. The capability
    fan-out is deferred; the operator ships a pre-built batch.
  • Validate() method: fails closed when a proposer is wired
    without a LeaderView so production code in main.go cannot
    silently let followers mutate state. Tests still get the
    no-leaderView affordance for proposer-side unit tests.
  • Design doc: Stage 5 row split — 5A shipped (feat(encryption): Stage 5 PR-A — EncryptionAdmin proto + read-only RPCs + status CLI #754), 5B
    shipped (feat(encryption): Stage 5 PR-B — RotateDEK + RegisterEncryptionWriter mutators #756), 5C in PR, 5D open.

Out of scope (Stage 5D / Stage 6)

  • §5.6 step 1a capability fan-out helper (auto-build the writer
    batch by dialing every cluster member's GetCapability).
  • main.go gRPC wiring (register EncryptionAdminServer on the
    shard gRPC listener, plumb proposer/leaderView from the shard's
    raftengine.Engine, call Validate() at startup).
  • Stage 6 --encryption-enabled cluster flag + §9.1 startup
    refusal guards.

Test plan

  • go test -race -timeout=60s ./adapter -run TestEncryptionAdmin
    — adds:
    • BootstrapEncryption_HappyPath (wire-layout round-trip via
      fsmwire.DecodeBootstrap, row-ordering pinned).
    • BootstrapEncryption_RejectsBadInputs (8-case table: zero
      ids, equal ids, empty wrapped DEKs, empty batch, zero
      full_node_id, epoch > 0xFFFF).
    • BootstrapEncryption_RejectsOversizeBatch (above
      bootstrapBatchRowCap).
    • Validate_* three-way: no proposer / both wired /
      proposer-without-leaderView.
    • Existing MutatingRPCs_RejectWithoutProposer extended to
      cover Bootstrap.
  • go test -race -timeout=60s ./cmd/elastickv-admin -run TestRunEncryptionBootstrap
    — CLI happy path against a stub gRPC server + 5-case
    parse-error table (--writer malformed).
  • golangci-lint run on touched paths — 0 issues.
  • CI verifies build / lint / Jepsen.

Self-review (CLAUDE.md 5 passes)

  1. Data loss — leader-gated mutator on top of Stage 4's
    HaltApply seam. Empty batch / oversize batch rejected at
    the gRPC boundary. Idempotency lives at the FSM apply layer
    (rejects if active.storage != 0).
  2. Concurrency — no new mutable state in the server. The
    §5.6 step 1a fan-out itself is intentionally deferred.
  3. Performance — hot path unchanged. Mutator is one-shot
    per cluster lifetime.
  4. Consistencyfsmwire encoder is the single source of
    truth for the on-wire byte layout. The round-trip test locks
    the server's view against fsmwire's. Row-ordering is
    pinned (storage row i precedes raft row i).
  5. Test coverage — boundary table, oversize batch,
    no-proposer rejection, leader-gating (inherited from
    requireLeader), Validate() three-way coverage, CLI happy
    path, CLI parse-time error table.

Notes

Summary by CodeRabbit

  • New Features

    • Added a leader-only "Bootstrap Encryption" admin RPC and an "encryption bootstrap" CLI to initialize encryption state in a single operation.
  • Improvements

    • Stricter server startup validation to catch invalid configurations early.
    • Tightened wrapped-DEK and writer-entry validation (size, bounds, uniqueness).
  • Bug Fixes / Tests

    • Extensive tests covering bootstrap, validation, and rotate-DEK rejection cases.
  • Documentation

    • Updated encryption-at-rest rollout milestones to reflect bootstrap completion.

Review Change Stack

…+ Validate()

Stage 5 PR-C of the data-at-rest encryption rollout (design doc
docs/design/2026_04_29_partial_data_at_rest_encryption.md, §5.6
bootstrap, §6.1 admin.go, §6.6 admin commands). Closes the last
mutating-RPC gap in the proto.EncryptionAdmin surface and adds
the production-wiring guard the PR-B review carry-forward
identified.

## adapter/encryption_admin.go

- BootstrapEncryption proposes the §5.6 0x04 OpBootstrap entry on
  top of Stage 3's raft envelope and Stage 4's fsmwire body
  encoder. Validates:
  - storage_dek_id != 0
  - raft_dek_id != 0
  - storage_dek_id != raft_dek_id (no slot collision)
  - non-empty wrapped DEK pair
  - writer_batch contains at least one member
  - each writer.full_node_id != 0 (§6.1 sentinel rule)
  - each writer.local_epoch <= 0xFFFF (§4.1 16-bit bound)
  - 2*len(writers) <= bootstrapBatchRowCap = 16384 (matches
    fsmwire.maxBootstrapBatchCount; rejected at the gRPC
    boundary so a too-large batch does not round-trip through
    Propose)
- buildBootstrapBatch / validateBootstrapWriter expand each
  writer into the §5.6 step 1a two-row layout (one
  storage-DEK row + one raft-DEK row) so the §4.1
  nonce-uniqueness invariant covers both envelope purposes from
  the first post-bootstrap entry. Row ordering is fixed: storage
  row i precedes raft row i, locked by the wire-layout
  round-trip test.
- New Validate() method enforces the option-pairing invariant:
  proposer wired without leaderView fails closed. Production
  wiring in main.go must call this; tests are explicitly allowed
  to skip it (the proposer-without-leaderView path is a useful
  test affordance for the propose-side wiring).

## cmd/elastickv-admin/encryption_mutators.go

- New `encryption bootstrap` subcommand:
  --storage-dek-id, --raft-dek-id (uint32, non-zero, distinct)
  --wrapped-storage-dek, --wrapped-raft-dek (base64)
  --writer=<full_node_id>:<local_epoch> (repeated; required)
- parseBootstrapArgs / validateBootstrapDEKIDs /
  decodeBootstrapWrappedDEKs split out for cyclop.
- parseWriterBatch enforces the same per-writer invariants the
  server enforces (zero node id, epoch overflow, malformed
  string) so the CLI fails fast before the Propose round-trip.

## cmd/elastickv-admin/encryption.go

- encryptionMain dispatches the new subcommand and updates the
  usage text + unknown-subcommand error.

## Tests

adapter/encryption_admin_test.go:
- BootstrapEncryption happy path round-trips through
  fsmwire.DecodeBootstrap. Verifies storage/raft row pairing
  per writer.
- BootstrapEncryption rejects every boundary input (8 cases in
  a table: zero ids, equal ids, empty wrapped DEKs, empty
  batch, zero full_node_id, epoch overflow).
- BootstrapEncryption rejects oversize batch (just above the
  bootstrapBatchRowCap boundary).
- BootstrapEncryption inherits the no-proposer FailedPrecondition
  contract via TestEncryptionAdmin_MutatingRPCs_RejectWithoutProposer.
- Validate() three-way coverage: proposer-without-leaderView →
  error; read-only server → nil; both wired → nil.

cmd/elastickv-admin/encryption_test.go:
- runEncryptionBootstrap happy path against a stub gRPC server.
- runEncryptionBootstrap rejects 5 malformed --writer cases at
  the CLI parse boundary.

## Out of scope (Stage 5D / Stage 6)

- §5.6 step 1a capability fan-out helper. PR-C lets the operator
  ship a pre-built writer batch on the CLI; a future iteration
  adds an in-CLI fan-out that dials every cluster member's
  GetCapability and assembles the batch automatically.
- main.go gRPC wiring (register EncryptionAdminServer on the
  shard gRPC listener, plumb proposer/leaderView from the
  shard's raftengine.Engine, call Validate at startup). Stage 6
  cluster-flag gate (`--encryption-enabled`) will turn this on.

## Self-review (CLAUDE.md 5 passes)

1. Data loss: leader-gated mutator on top of Stage 4 HaltApply
   seam. Empty batch rejected at boundary. Oversize batch
   rejected at boundary. Idempotency lives at the FSM apply
   layer (rejects if active.storage != 0).
2. Concurrency: no new mutable state. Proposer is set once at
   construction. The §5.6 step 1a fan-out itself is deferred.
3. Performance: hot path unchanged. Mutator is one-shot per
   cluster lifetime.
4. Consistency: fsmwire encoder is the single source of truth
   for the on-wire byte layout. Round-trip test
   (Encode→DecodeBootstrap) locks server's view against
   fsmwire's. Row-ordering pinned (storage_i, raft_i).
5. Test coverage: boundary table (8 cases), oversize batch,
   no-proposer rejection, leader-gating (inherited from
   requireLeader), Validate() three-way coverage, CLI happy
   path, CLI parse-time error table.

## Verification

- go test -race -timeout=60s ./adapter ./cmd/elastickv-admin: PASS
- golangci-lint run on touched paths: 0 issues
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 16, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 53 minutes and 35 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2ff51360-63e0-48bc-981d-3dc3e3195f96

📥 Commits

Reviewing files that changed from the base of the PR and between b80cfe8 and 421c7fc.

📒 Files selected for processing (1)
  • cmd/elastickv-admin/encryption_test.go
📝 Walkthrough

Walkthrough

Adds a leader-gated BootstrapEncryption RPC with validation and batch-building helpers, a CLI encryption bootstrap command and parsing/runtime, comprehensive server and CLI tests, and a docs milestone update.

Changes

Encryption Bootstrap Feature

Layer / File(s) Summary
Server validation contract and BootstrapEncryption implementation
adapter/encryption_admin.go
EncryptionAdminServer now requires Validate() to enforce proposer+leaderView pairing; BootstrapEncryption RPC validates DEK ids and wrapped DEKs, builds and bounds an expanded writer-registry payload, encodes an FSM bootstrap message, proposes it via raft, and returns the applied index.
BootstrapEncryption server tests
adapter/encryption_admin_test.go
Validation wiring tests cover proposer-without-leaderView rejection, read-only acceptance, and fully-wired acceptance; happy-path verifies proposal opcode, applied index, and fsmwire.DecodeBootstrap round-trip; table-driven and boundary tests cover invalid request fields and oversize batches.
CLI bootstrap command integration
cmd/elastickv-admin/encryption.go
Adds bootstrap subcommand case to the encryption dispatcher and updates help/usage text to advertise the new command.
CLI bootstrap argument parsing and execution
cmd/elastickv-admin/encryption_mutators.go
runEncryptionBootstrap orchestrates client dialing and RPC execution; parseBootstrapArgs defines and validates --storage-dek-id, --raft-dek-id, --wrapped-storage-dek, --wrapped-raft-dek, and repeated --writer=<node_id>:<epoch> flags; decoding and numeric bounds are enforced and converted into pb.BootstrapEncryptionRequest.
CLI bootstrap tests
cmd/elastickv-admin/encryption_test.go
Test stub captures BootstrapEncryption calls; happy-path test validates output and request field round-trip; negative test verifies writer-flag format and value rejection.
Design documentation updates
docs/design/2026_04_29_partial_data_at_rest_encryption.md
Stage 5C shipped scope reassigned to include BootstrapEncryption RPC, encryption bootstrap CLI, and Validate() enforcement; capability fan-out helper moved to Stage 5D open work.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • bootjp/elastickv#748: Introduced the Stage-4 fsmwire bootstrap wire format and DecodeBootstrap used for payload round-trip verification in these tests.
  • bootjp/elastickv#707: Related design work for encryption-at-rest bootstrapping and registry/rotation protocol that this PR implements parts of.
  • bootjp/elastickv#754: Prior PR that left mutating EncryptionAdmin RPCs unimplemented; this PR fills in BootstrapEncryption and expands validation.

Poem

🐰 I bounced in with keys to sow,
Proposer gates keep changes slow,
CLI flags parsed, writers spun,
FSM encodes the bootstrap run,
Raft applied — our cipher grows.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 43.90% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly identifies the primary change as implementing Stage 5 PR-C of the encryption rollout, specifically BootstrapEncryption RPC, bootstrap CLI, and Validate() method.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/encryption-admin-bootstrap

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
adapter/encryption_admin_test.go (1)

411-447: 💤 Low value

Consider adding a test case for nil writer in the batch.

The implementation at encryption_admin.go:425-428 validates against nil writers, but this table doesn't include a test case for writer_batch containing a nil entry. This would ensure the nil-check path is exercised.

♻️ Suggested test case addition
 		{"writer with local_epoch above 0xFFFF", func(r *pb.BootstrapEncryptionRequest) {
 			r.WriterBatch = []*pb.WriterRegistryEntry{{FullNodeId: 1, LocalEpoch: 0x10000}}
 		}},
+		{"nil writer in batch", func(r *pb.BootstrapEncryptionRequest) {
+			r.WriterBatch = []*pb.WriterRegistryEntry{nil}
+		}},
 	} {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@adapter/encryption_admin_test.go` around lines 411 - 447, Add a table-driven
test case to TestEncryptionAdmin_BootstrapEncryption_RejectsBadInputs that
inserts a nil entry into WriterBatch (e.g., append a nil
*pb.WriterRegistryEntry) to exercise the nil-writer validation in
BootstrapEncryption; ensure the test builds a valid request via
validBootstrapEncryptionRequest(), mutates WriterBatch to include a nil entry,
calls srv.BootstrapEncryption(...), and asserts the returned gRPC status code is
codes.InvalidArgument just like the other cases.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cmd/elastickv-admin/encryption_mutators.go`:
- Around line 305-313: The decodeBootstrapWrappedDEKs function currently calls
decodeWrappedDEKFlag which emits errors referencing --wrapped-new-dek; replace
that with a flag-specific helper (e.g., decodeWrappedDEKFlagWithName(flagName, s
string)) and call it twice from decodeBootstrapWrappedDEKs passing
"wrapped-storage-dek" for storageB64 and "wrapped-raft-dek" for raftB64 so
errors mention the exact bootstrap flag; ensure the helper validates empty
string, base64 decode errors, and zero-length decoded output and wraps/errors
use the provided flagName in their messages.

---

Nitpick comments:
In `@adapter/encryption_admin_test.go`:
- Around line 411-447: Add a table-driven test case to
TestEncryptionAdmin_BootstrapEncryption_RejectsBadInputs that inserts a nil
entry into WriterBatch (e.g., append a nil *pb.WriterRegistryEntry) to exercise
the nil-writer validation in BootstrapEncryption; ensure the test builds a valid
request via validBootstrapEncryptionRequest(), mutates WriterBatch to include a
nil entry, calls srv.BootstrapEncryption(...), and asserts the returned gRPC
status code is codes.InvalidArgument just like the other cases.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4d9325e0-b34d-4523-b4f9-f53648e5af31

📥 Commits

Reviewing files that changed from the base of the PR and between 7fdc8be and 7cab7f1.

📒 Files selected for processing (6)
  • adapter/encryption_admin.go
  • adapter/encryption_admin_test.go
  • cmd/elastickv-admin/encryption.go
  • cmd/elastickv-admin/encryption_mutators.go
  • cmd/elastickv-admin/encryption_test.go
  • docs/design/2026_04_29_partial_data_at_rest_encryption.md

Comment thread cmd/elastickv-admin/encryption_mutators.go
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements the BootstrapEncryption RPC and its corresponding CLI subcommand, completing a significant portion of the encryption roadmap. Key changes include the addition of the Validate method to EncryptionAdminServer to prevent misconfiguration, the implementation of the BootstrapEncryption logic with strict boundary checks, and the expansion of the administrative CLI to support cluster bootstrapping. Feedback focuses on strengthening the robustness of the bootstrap process by suggesting additional validation for node ID uniqueness within the writer batch to prevent fatal FSM errors and recommending size limits on wrapped DEK payloads to mitigate potential denial-of-service vectors.

Comment thread adapter/encryption_admin.go Outdated
"encryption: writer_batch produces %d registry rows, exceeding the §4.1 bound %d",
totalRows, bootstrapBatchRowCap)
}
out := make([]fsmwire.RegistrationPayload, 0, totalRows)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The writer_batch should be validated for uniqueness of full_node_id and its 16-bit truncation. If the batch contains duplicates or collisions on the 16-bit node_id, the FSM apply will fail with ErrLocalEpochRollback or ErrNodeIDCollision. Since FSM apply errors are fatal (per §6.3), a malformed bootstrap request could halt the entire cluster. Validating this at the gRPC boundary ensures the proposed Raft entry is safe to apply.

	seenFull := make(map[uint64]bool)
	seenTrunc := make(map[uint16]bool)
	for _, w := range writers {
		if w == nil { continue }
		id := w.GetFullNodeId()
		trunc := uint16(id)
		if seenFull[id] {
			return nil, grpcStatusErrorf(codes.InvalidArgument, "encryption: duplicate full_node_id %d in batch", id)
		}
		if seenTrunc[trunc] {
			return nil, grpcStatusErrorf(codes.InvalidArgument, "encryption: node_id collision on 0x%04x in batch", trunc)
		}
		seenFull[id], seenTrunc[trunc] = true, true
	}
	out := make([]fsmwire.RegistrationPayload, 0, totalRows)
References
  1. Apply validation to external requests to reject malformed data that could cause fatal errors or OOM issues.

Comment thread adapter/encryption_admin.go Outdated
Comment on lines +361 to +366
if len(req.GetWrappedStorageDek()) == 0 {
return nil, grpcStatusError(codes.InvalidArgument, "encryption: wrapped_storage_dek is required")
}
if len(req.GetWrappedRaftDek()) == 0 {
return nil, grpcStatusError(codes.InvalidArgument, "encryption: wrapped_raft_dek is required")
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

It is recommended to validate the size of the wrapped_storage_dek and wrapped_raft_dek payloads. The fsmwire.EncodeBootstrap helper uses safeU32 which panics if the length exceeds math.MaxUint32. While gRPC message limits usually prevent this, explicit validation here avoids a potential DoS vector where a crafted request could crash the leader process.

if len(req.GetWrappedStorageDek()) == 0 {
		return nil, grpcStatusError(codes.InvalidArgument, "encryption: wrapped_storage_dek is required")
	}
	if len(req.GetWrappedStorageDek()) > 4096 {
		return nil, grpcStatusError(codes.InvalidArgument, "encryption: wrapped_storage_dek exceeds maximum allowed size")
	}
	if len(req.GetWrappedRaftDek()) == 0 {
		return nil, grpcStatusError(codes.InvalidArgument, "encryption: wrapped_raft_dek is required")
	}
	if len(req.GetWrappedRaftDek()) > 4096 {
		return nil, grpcStatusError(codes.InvalidArgument, "encryption: wrapped_raft_dek exceeds maximum allowed size")
	}
References
  1. When handling requests, always limit the size of the request body or payloads to prevent Denial of Service attacks from oversized payloads.

…bbit minor

Addresses three findings on PR #759. The high-priority item is the
cluster-safety regression that would crash a cluster on a
malformed bootstrap; addressed first per the prioritization rule.

## gemini high — writer_batch uniqueness validation

A bootstrap request with duplicate `full_node_id` OR two distinct
`full_node_id`s colliding on the uint16 narrowing used as the
§4.1 registry-row key (`!encryption|writers|<dek_id>|<uint16(node_id)>`)
reaches the FSM with a payload that fails closed at apply time
(case 3: ErrLocalEpochRollback for the duplicate, case 4:
ErrNodeIDCollision for the collision). FSM apply errors halt
the apply loop via the HaltApply seam, so a single malformed
bootstrap would stop every replica in the cluster.

New `validateWriterBatchUniqueness` rejects both cases at the
gRPC boundary with codes.InvalidArgument before the Propose
round-trip. The check uses a uint64→uint16 narrowing identical
to the registry row key.

Regression tests:
- TestEncryptionAdmin_BootstrapEncryption_RejectsDuplicateFullNodeID
- TestEncryptionAdmin_BootstrapEncryption_RejectsNodeIDCollision

Caller audit: validateWriterBatchUniqueness is called only from
buildBootstrapBatch which is called only from BootstrapEncryption.
No existing happy-path test uses duplicates or 16-bit collisions.

## gemini security-medium — wrapped DEK size cap

fsmwire.EncodeBootstrap uses safeU32 on the length prefix; a
crafted oversize wrapped DEK could reach the panic guard or
inflate the proposed Raft entry beyond practical limits. Add a
4 KiB per-field cap at the gRPC boundary (real KEK-wrapped 32B
DEKs run hundreds of bytes; 4 KiB is ~10x headroom).

New `validateWrappedDEK` enforces the cap and the existing
non-empty check in one place. Both wrapped DEK validations
route through it.

Regression test:
- TestEncryptionAdmin_BootstrapEncryption_RejectsOversizeWrappedDEK
  (covers both storage and raft fields with maxWrappedDEKSize+1
  bytes)

## coderabbit minor — bootstrap wrapped-DEK error flag names

decodeWrappedDEKFlag emitted "--wrapped-new-dek" in its error
text (the rotate-dek flag name); bootstrap users got mixed flag
names in one message. New decodeWrappedDEKFlagWithName takes the
flag name and formats the error accordingly. decodeWrappedDEKFlag
keeps its signature by delegating to the helper with the
rotate-dek default. decodeBootstrapWrappedDEKs passes the correct
bootstrap flag names.

## Verification

- go test -race -timeout=60s ./adapter ./cmd/elastickv-admin: PASS
- golangci-lint run on touched paths: 0 issues
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 16, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 16, 2026

Claude finished @bootjp's task in 3m 7s —— View job


Code Review — Stage 5 PR-C (BootstrapEncryption + CLI + Validate)

  • Read changed files
  • Analyze BootstrapEncryption implementation
  • Analyze Validate() method
  • Analyze CLI (encryption_mutators.go)
  • Analyze tests
  • Review existing CodeRabbit/Gemini comments for validity

Prior Review Comments — Status Update

Both Gemini's concerns and CodeRabbit's actionable comment are already resolved in the submitted code:

  • Gemini: writer batch uniquenessvalidateWriterBatchUniqueness() (encryption_admin.go:482) catches duplicate full_node_id and uint16-narrowing collisions before a Propose round-trip. Tests RejectsDuplicateFullNodeID and RejectsNodeIDCollision pin both cases.
  • Gemini: wrapped DEK size capvalidateWrappedDEK() (encryption_admin.go:462) enforces maxWrappedDEKSize = 4096 for both storage and raft fields. RejectsOversizeWrappedDEK covers it.
  • CodeRabbit: --wrapped-new-dek in bootstrap error messagesdecodeBootstrapWrappedDEKs already calls decodeWrappedDEKFlagWithName("wrapped-storage-dek", ...) and decodeWrappedDEKFlagWithName("wrapped-raft-dek", ...) directly (encryption_mutators.go:314, 318). The decodeWrappedDEKFlag wrapper for rotate-dek now delegates to the name-aware helper. No action needed.

New Findings

1. Dangling comment in encryption_mutators.go (lines 209–213) — Bug

The comment block at lines 209–213 reads:

// narrowUint32 mirrors adapter.uint32ToLocalEpoch: callers MUST
// have validated `v` against math.MaxUint32 first (via
// requireUint32). The masked conversion is a defence-in-depth
// truncation that cannot drift even if a future caller skips the
// bound check.
// runEncryptionBootstrap invokes EncryptionAdmin.BootstrapEncryption.

This merges the doc comment for narrowUint32 with the intro of runEncryptionBootstrap in a single comment block, without a blank line separator. The narrowUint32 function itself is defined at line 371, where the preceding comment only covers uint32Mask — the body-level explanation ("callers MUST have validated...") never reaches its target. The merged block renders as the runEncryptionBootstrap godoc in tooling.

Fix this →

2. RotateDEK has no upper bound on wrapped_new_dek — Inconsistency

BootstrapEncryption now calls validateWrappedDEK() which enforces the 4096-byte cap. RotateDEK only checks len(req.GetWrappedNewDek()) == 0 (encryption_admin.go:541); there is no upper-bound check. A crafted wrapped_new_dek > math.MaxUint32 bytes could still trigger the safeU32 panic in fsmwire.EncodeRotation (gRPC's 4 MiB message limit is the only guard today). This is pre-existing relative to this PR, but since validateWrappedDEK now exists and the same logic applies, it is worth wiring it in here too for consistency.

Fix this →

3. Nil writer test case missing — Nit (CodeRabbit confirmed valid)

validateBootstrapWriter correctly rejects nil entries at encryption_admin.go:428. However TestEncryptionAdmin_BootstrapEncryption_RejectsBadInputs does not have a {"nil writer in batch", func(r) { r.WriterBatch = []*pb.WriterRegistryEntry{nil} }} case. The nil path is exercised transitively, but the absence of an explicit table entry leaves the nil-writer guard invisible to a reader scanning the test table.

Fix this →

4. Uniqueness validation runs after building output — Minor

In buildBootstrapBatch (encryption_admin.go:394), the loop builds the full out slice (up to bootstrapBatchRowCap rows) before validateWriterBatchUniqueness is called. For a near-cap batch with a collision at the end, the work of building ~16K entries is thrown away. The fix is to run validateWriterBatchUniqueness(writers) before allocating and filling out. Functionally correct as-is; purely an efficiency note.

5. Dead nil-guard in validateWriterBatchUniqueness — Minor

validateWriterBatchUniqueness has if w == nil { continue } at line 486. By the time this function is reached, validateBootstrapWriter has already returned an error for any nil entry, so the guard is unreachable in practice. It is harmless but could mislead readers into thinking nil writers might validly pass the earlier loop.


What's well done

  • Validate() design is clean: fails closed in production, permissive in unit tests (nil leaderView → always-leader), and the invariant is clearly documented in both the godoc and NewEncryptionAdminServer.
  • The two-rows-per-writer expansion and storage/raft row ordering are pinned by the assertBootstrapWriterRows helper — future refactors of fsmwire.EncodeBootstrap will break the test before they break production.
  • validateWriterBatchUniqueness correctly catches both the exact-duplicate and the uint16-narrowing-collision cases that would otherwise cause a fatal FSM apply halt. Having tests for both scenarios is the right call.
  • decodeWrappedDEKFlagWithName unifying the error-message logic for all three wrapped-DEK flags is clean.
  • proposeErrorToStatus / verifyLeaderErrorToStatus mapping is consistent and gives clients structured retry semantics.

Summary: Two items worth fixing before merge — the dangling comment (documentation correctness) and the RotateDEK size cap gap (consistency/defence-in-depth). The nil writer test case and the ordering nit are low priority. The prior CodeRabbit/Gemini findings are all resolved.

…r cleanups

claude[bot] round-1 confirmed all prior Gemini/CodeRabbit findings
resolved and surfaced 5 new items. This commit addresses every one.

## #1 bug — dangling narrowUint32 docstring (encryption_mutators.go)

A prior commit collapsed the narrowUint32 doc comment into the
runEncryptionBootstrap godoc block without a blank-line separator,
so the body-level explanation ("callers MUST have validated v
against math.MaxUint32 first") never reached its target function
and tooling rendered the narrowUint32 rationale as part of
runEncryptionBootstrap's godoc. Move the comment to immediately
precede the actual narrowUint32 declaration.

## #2 inconsistency — RotateDEK lacks wrapped_new_dek size cap

BootstrapEncryption gained a 4096-byte cap via validateWrappedDEK
in round-1; RotateDEK still only checked for empty. A crafted
oversize wrapped_new_dek could push fsmwire.EncodeRotation toward
its safeU32 length-prefix guard (gRPC's 4 MiB default was the only
upstream guard). Route RotateDEK through validateWrappedDEK("new",
...) so both mutators share the same fail-closed bound.

Caller audit (fail-closed semantic change): RotateDEK now rejects
wrapped_new_dek > maxWrappedDEKSize at the gRPC boundary.
- CLI runEncryptionRotateDEK: --wrapped-new-dek base64 input. A
  user passing > 4 KiB hits the new bound; realistic KEK-wrapped
  DEKs run hundreds of bytes, well under the cap.
- Other gRPC clients: future Stage 6+ wiring inherits the bound
  cluster-wide.
- Tests: validRotateDEKRequest uses 13-byte "wrapped-bytes"
  fixture; unchanged.

No silent semantic regression downstream.

Regression test:
- TestEncryptionAdmin_RotateDEK_RejectsOversizeWrappedDEK
  exercises the maxWrappedDEKSize+1 boundary.

## #3 nit — nil writer test case missing

Add {"nil writer in batch", ...} to
TestEncryptionAdmin_BootstrapEncryption_RejectsBadInputs so the
validateBootstrapWriter nil-guard has an explicit table entry
rather than being covered only transitively.

## #4 minor — uniqueness check runs after building output

buildBootstrapBatch was building the full registry-row slice
(up to bootstrapBatchRowCap = 16384 rows) before
validateWriterBatchUniqueness ran. For a near-cap batch with a
collision at the tail, the work was thrown away. Restructure:

  1. Run validateBootstrapWriter over every entry (per-writer
     shape).
  2. Run validateWriterBatchUniqueness (cross-entry).
  3. Allocate and fill `out` from the validated slice.

No behaviour change for valid batches; invalid batches now
short-circuit before the allocation. Tests cover both halves.

## #5 minor — dead nil-guard in validateWriterBatchUniqueness

With the reorder above, validateBootstrapWriter rejects any nil
entry before validateWriterBatchUniqueness ever sees it. The
`if w == nil { continue }` defensive check inside the latter is
now unreachable; remove it and document the contract in the
function's godoc.

## Verification

- go test -race -timeout=60s ./adapter ./cmd/elastickv-admin: PASS
- golangci-lint run on touched paths: 0 issues
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 16, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 16, 2026

Claude finished @bootjp's task in 2m 38s —— View job


Code Review — Stage 5 PR-C (BootstrapEncryption + CLI + Validate) — Round 2

  • Read changed files
  • Analyze BootstrapEncryption + Validate() implementation
  • Analyze CLI (encryption_mutators.go)
  • Analyze tests
  • Check fix commits resolve prior review findings
  • Post final review

Prior Review Findings — All Resolved ✓

Every issue raised in the round-1 reviews (CodeRabbit, Gemini, and prior Claude) is addressed in commits c7037ba and b80cfe8:

Finding Resolution
Dangling comment (narrowUint32 doc merged into runEncryptionBootstrap) Split — narrowUint32 doc now sits at line 366, runEncryptionBootstrap gets its own clean doc at line 209
RotateDEK no upper bound on wrapped_new_dek Now calls validateWrappedDEK("new", ...) at line 551; TestEncryptionAdmin_RotateDEK_RejectsOversizeWrappedDEK pins it
Nil writer test case missing Added as the last entry in the RejectsBadInputs table (line 437–439)
Uniqueness validation ran after building output Both validation passes now precede make([]fsmwire.RegistrationPayload, ...) at line 419; documented in comment
Dead nil-guard in validateWriterBatchUniqueness Removed; invariant documented with a "caller guarantees non-nil" comment
decodeBootstrapWrappedDEKs emitting --wrapped-new-dek in errors decodeWrappedDEKFlagWithName now called directly with the flag-specific name; decodeWrappedDEKFlag delegates to it

Fresh Findings (post-fix)

1. CLI test gap: empty --writer flag batch — Minor

parseBootstrapArgs (line 273) returns an error when len(writerFlags) == 0. This branch has no CLI-level test. The server-side path is covered (RejectsBadInputs → "empty writer_batch"), but the CLI parse layer (errors.New("at least one --writer=...")) is not exercised by TestRunEncryptionBootstrap_RejectsBadWriter. The 5-case table there covers malformed --writer values, not the missing-entirely case.

Fix this →

2. parseWriterBatch epoch parse uses bitSize=32 but validates for uint16 — Observation (not a bug)

strconv.ParseUint(epochStr, 10, 32) (line 348) parses up to 4 294 967 295; the epoch > math.MaxUint16 check on line 350 is then the real gate. This is correct behaviour (using bitSize=16 would surface a parse error for 65536 rather than the cleaner custom error), but worth noting that the bitSize=32 is load-bearing as a permissive parse specifically so the custom error message fires. No change needed — just ensuring the reasoning is visible to reviewers.


What's well done

The two fix commits addressed every actionable comment cleanly and without scope creep:

  • buildBootstrapBatch now validates before allocating: both the per-writer shape pass and the uniqueness pass complete before make([]fsmwire.RegistrationPayload, ...), so a near-cap batch with a collision at the end wastes no allocations.
  • The validateWriterBatchUniqueness nil-guard removal is paired with a clear invariant comment — future readers won't wonder why the nil case is skipped.
  • RotateDEK wrapped-DEK size cap is consistent with Bootstrap now, and the TestEncryptionAdmin_RotateDEK_RejectsOversizeWrappedDEK test is a proper twin of the Bootstrap test.
  • decodeWrappedDEKFlag correctly delegates to decodeWrappedDEKFlagWithName("wrapped-new-dek", ...) so rotate-dek error messages are unchanged while bootstrap gets flag-specific text.
  • The Validate() design is clean: fails closed in production (proposer-without-leaderView), permissive for unit tests (nil leaderView = always-leader), and the three-way test coverage (OKWithoutProposer, OKWithBothWired, RejectsProposerWithoutLeaderView) is solid.

Verdict: The two actionable items from round 1 are fully resolved. The single remaining gap (CLI test for missing --writer) is low priority. This is ready to merge once you decide whether to chase the missing CLI test case before merging or defer it.

claude[bot] round-2 confirmed all actionable items resolved; the
single remaining gap was a CLI-side test for the missing-entirely
case (no --writer flags at all). The server-side "empty
writer_batch" path is covered by adapter tests; this fills the
parse-layer test gap with TestRunEncryptionBootstrap_RejectsEmptyWriterBatch.

The parseWriterBatch bitSize=32 vs uint16 observation is
intentional design (permissive parse so the custom epoch-bound
error fires instead of strconv's parse error) and needs no
change.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 16, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 16, 2026

Claude finished @bootjp's task in 3m 9s —— View job


Code Review — Stage 5 PR-C (Round 3)

  • Read changed files
  • Verify Round 2 finding resolved (CLI empty --writer batch test)
  • Fresh review pass on all changed files
  • Post final verdict

Round 2 Finding — Resolved ✓

TestRunEncryptionBootstrap_RejectsEmptyWriterBatch (lines 257–278 of encryption_test.go) was added in commit 421c7fc. It calls runEncryptionBootstrap with valid DEK ids and wrapped-DEK flags but no --writer flags, then asserts a non-nil error containing "writer". This is the exact path in parseBootstrapArgs at line 273 that was previously uncovered. The test uses "127.0.0.1:1" as the endpoint, which is safe here because the parse error returns before any dial occurs.


Fresh Pass — No New Issues

All code paths reviewed:

Area Finding
buildBootstrapBatch validation ordering Both per-writer shape pass and uniqueness pass precede make(...). Correct.
validateWrappedDEK scope Called for both storage and raft fields in BootstrapEncryption and for wrapped_new_dek in RotateDEK. Consistent.
narrowUint32(uint(epoch)) in parseWriterBatch strconv.ParseUint(..., 10, 32) bounds epoch ≤ math.MaxUint32 before the uint(epoch) conversion. Safe on all platforms.
decodeWrappedDEKFlagdecodeWrappedDEKFlagWithName decodeWrappedDEKFlag delegates to decodeWrappedDEKFlagWithName("wrapped-new-dek", ...) so rotate-dek error messages are unchanged; bootstrap calls the name-aware variant directly with "wrapped-storage-dek" / "wrapped-raft-dek". Correct.
narrowUint32 comment placement Clean — doc comment now sits at line 366, directly above the function at 371. runEncryptionBootstrap has its own separate doc at line 209.
nil writer in batch test case Present in RejectsBadInputs at line 437–439.
RotateDEK size cap consistency Calls validateWrappedDEK("new", ...) at line 551; TestEncryptionAdmin_RotateDEK_RejectsOversizeWrappedDEK pins it.
Validate() three-way coverage OKWithoutProposer / OKWithBothWired / RejectsProposerWithoutLeaderView all present.

One theoretical observation (not actionable): bootstrapRowsPerWriter * len(writers) could overflow on a hypothetically enormous len(writers), but gRPC's 4 MiB default message limit makes this practically impossible — max realistic writer count is ~2 million, far below the math.MaxInt/2 threshold. Not a bug, not worth a guard.


What's Well Done

The three-commit fix sequence (c7037bab80cfe8421c7fc) addressed every actionable item from rounds 1 and 2 without introducing new issues or scope creep:

  • Validation precedes allocation in buildBootstrapBatch — a near-cap batch with a late collision wastes no rows.
  • validateWriterBatchUniqueness nil-guard removal is paired with a clear invariant comment.
  • RotateDEK wrapped-DEK size cap is consistent with BootstrapEncryption.
  • The Validate() fail-closed contract is clearly documented, tested three ways, and enforced by the godoc on NewEncryptionAdminServer.
  • assertBootstrapWriterRows pins the two-rows-per-writer layout at the test level, so any future fsmwire.EncodeBootstrap refactor that desynchronizes the layout breaks the test before it breaks production.

Verdict: All prior findings resolved. No new issues. Ready to merge.

@bootjp bootjp merged commit 491cc79 into main May 16, 2026
9 checks passed
@bootjp bootjp deleted the feat/encryption-admin-bootstrap branch May 16, 2026 14:29
bootjp added a commit that referenced this pull request May 16, 2026
Codex P1: startRaftServers wired rt.spec.id (the Raft GROUP id)
as full_node_id, but group ids are shared by every replica of a
group (every node uses group 1 in the default config). All nodes
therefore returned the same CapabilityReport.full_node_id, and
BootstrapEncryption's validateWriterBatchUniqueness check (added
in PR #759 for the gemini HIGH on writer batches) would reject
the bootstrap with "duplicate full_node_id" — the very gate the
batch validation was meant to enforce now turns into the failure
mode that prevents the cluster from bootstrapping at all.

Fix: derive a per-node-stable uint64 from --raftId via the
canonical etcd.DeriveNodeID(id string) uint64 helper that
raftengine itself uses for peer identification (FNV-1a hash, 0
sentinel collision mapped to 1). The call site now passes
etcdraftengine.DeriveNodeID(*raftId) and the helper's parameter
is renamed shardID → fullNodeID with a docstring spelling out
the must-be-per-node-stable contract so a future refactor does
not regress to the shard-id wiring.

Caller audit: registerEncryptionAdminServer has exactly one
production caller (main.go:1292), now passing the derived
per-node id. The single test caller passes a literal uint64 (1)
which is unaffected — its purpose is to verify the gRPC
registration path, not the id-derivation correctness.

Regression test TestEncryptionAdminFullNodeID_DistinctPerRaftId
pins the canonical 3-node config (n1/n2/n3) produces three
distinct non-zero ids so a future refactor breaking the
derivation surfaces as a failing unit test rather than a
cluster-wide bootstrap deadlock.

go build, go vet, go test -race, golangci-lint all clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant