Skip to content

feat(encryption): Stage 6C-1 — §9.1 startup-refusal guards (flag + sidecar state)#778

Open
bootjp wants to merge 3 commits into
mainfrom
feat/encryption-stage6c
Open

feat(encryption): Stage 6C-1 — §9.1 startup-refusal guards (flag + sidecar state)#778
bootjp wants to merge 3 commits into
mainfrom
feat/encryption-stage6c

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented May 17, 2026

Summary

Stage 6C-1 per the PR #762 plan. Decomposes the original monolithic Stage 6C into four sub-milestones (6C-1 → 6C-4) per dependency footprint, then ships 6C-1: the flag + sidecar-state §9.1 startup-refusal guards.

The Stage 6B-2 encryptionMutatorsEnabled triple gate is the RPC-boundary safety net; this PR adds the process-boundary safety net on top.

Stage 6C sub-decomposition (design doc update)

The original Stage 6C bundled guards with three different dependency footprints. Shipping the load-bearing downgrade-prevention guard (ErrSidecarPresentWithoutFlag) alone would have been blocked on Stage 7's writer-registry work (the local_epoch rollback peer). The sub-decomposition lets 6C-1 ship now and unblocks Stage 6D once 6C-1 + 6C-2 land:

Sub-milestone Scope Status
6C-1 (this PR) Flag + sidecar-state guards this PR
6C-2 Raftengine integration (sidecar/raft index gap, fsync probe, local_epoch exhaustion) open
6C-3 Membership-view guards (node_id collision, local_epoch rollback) — bundles with Stage 6D open
6C-4 Phase-2-specific guards — bundles with Stage 6E open

What this PR ships

Three new §9.1 startup-refusal sentinels

Sentinel Catches
ErrSidecarPresentWithoutFlag Sidecar on disk but --encryption-enabled off — would silently downgrade new writes to cleartext
ErrKEKRequiredWithFlag --encryption-enabled set but no --kekFile — flag-on / KEK-off is misconfigured at the operator-intent level
ErrKEKMismatch KEK loaded but at least one wrapped DEK fails to unwrap — classic "wrong --kekFile from a different cluster" error

ErrKEKMismatch annotates the offending key_id and purpose so runbooks can grep purpose="raft" / purpose="storage" to identify the failing DEK.

internal/encryption/startup.go (new)

type StartupConfig struct {
    EncryptionEnabled bool
    KEKConfigured     bool
    KEK               KEKUnwrapper
    SidecarPath       string
}

func CheckStartupGuards(cfg StartupConfig) error

Reads the sidecar at most once and is safe to call before any other encryption package state is constructed. Does NOT mutate on-disk state.

main.go

loadKEKAndRunStartupGuards() pairs the KEK load with CheckStartupGuards in a single helper (cyclop budget). Runs BEFORE buildShardGroups.

Posture matrix (before / after this PR)

Operator config pre-this-PR post-this-PR
flag off, no sidecar boot (cleartext) boot (cleartext)
flag off, sidecar present boot, silently downgrade new writes refuse: ErrSidecarPresentWithoutFlag
flag on, no KEK boot, mutators refused at RPC gate refuse: ErrKEKRequiredWithFlag
flag on, KEK loaded, no sidecar yet boot (waiting for bootstrap) boot (waiting for bootstrap)
flag on, KEK loaded, sidecar matches boot, encrypted boot, encrypted
flag on, KEK loaded, sidecar wrapped under different KEK boot, first write corrupts refuse: ErrKEKMismatch

Caller audit (semantic changes)

No public function signatures changed. CheckStartupGuards / StartupConfig are net-additive. loadKEKAndRunStartupGuards is new and pairs with loadKEKWrapperFromFlag (which stays callable on its own).

Five-lens self-review

  1. Data loss — net-positive: refuses to boot in postures that would silently downgrade or land a HaltApply. No write path touched.
  2. Concurrency / distributed failures — runs at process startup before any goroutine fan-out / Raft engine start. ReadSidecar is side-effect-free.
  3. Performance — single os.Stat + one ReadSidecar (only when both --encryption-enabled and a sidecar exist) per process start. Zero hot-path impact.
  4. Data consistencyErrKEKMismatch catches the most severe operator-error footgun in the §9.1 set: wrong KEK would route writes to a DEK the prior wrapped material cannot decrypt.
  5. Test coverage — 7 sub-cases in TestCheckStartupGuards_* cover all three sentinels in fire + pass states, I/O error propagation, and the per-key purpose=raft error annotation that runbooks grep for.

Test plan

  • go test -race -timeout=120s ./internal/encryption/... . — PASS
  • go build ./... — PASS
  • golangci-lint run . ./internal/encryption/... — 0 issues
  • Full Jepsen suite — not run; no envelope cutover happens until Stage 6D and this PR only adds startup refusal logic that runs before any encryption-relevant Raft entry

Plan

Next Stage 6C sub-milestones per the design-doc update in this PR:

  • 6C-2 — raftengine integration guards (ErrSidecarBehindRaftLog, ErrUnsupportedFilesystem startup probe, ErrLocalEpochExhausted)
  • 6C-3 — membership-view guards (ErrNodeIDCollision, ErrLocalEpochRollback) — bundles with Stage 6D
  • 6C-4 — Phase-2-specific guards — bundles with Stage 6E

Summary by CodeRabbit

  • New Features

    • Added encryption configuration validation during startup to prevent invalid states.
    • Added early detection of encryption misconfigurations (disabled encryption with existing sidecar, enabled encryption without key file, key/sidecar mismatches).
  • Documentation

    • Updated encryption implementation roadmap with refined Stage 6 milestone decomposition and dependency documentation.
  • Tests

    • Added comprehensive test coverage for encryption startup validation scenarios.

Review Change Stack

… + sidecar state)

Stage 6C-1 per the PR #762 plan, decomposed in this PR's design-doc
update from the original monolithic Stage 6C into four sub-milestones
(6C-1 through 6C-4) per dependency footprint:

- 6C-1 (this PR): flag + sidecar-state guards
- 6C-2: raftengine-integration guards (sidecar/raft index gap,
  fsync probe, local_epoch exhaustion)
- 6C-3: membership-view guards (node_id collision, local_epoch
  rollback) — bundles with Stage 6D capability fan-out
- 6C-4: Phase-2-specific guards (cutover divergence,
  enable-raft-envelope-before-bootstrap, RPC local_epoch range)
  — bundles with Stage 6E Phase-2 cutover

## What this PR ships

### internal/encryption/startup.go (new)

CheckStartupGuards(StartupConfig) reports the first §9.1 refusal
condition that fires:

- ErrSidecarPresentWithoutFlag — data dir contains a sidecar
  (keys.json) but --encryption-enabled is NOT set. Continuing
  would silently downgrade new writes to cleartext while leaving
  the prior wrapped DEKs untouched on disk, inverting operator
  intent. Refuse to start rather than half-honor a prior bootstrap.

- ErrKEKRequiredWithFlag — --encryption-enabled set but no
  --kekFile supplied. The Stage 6B-2 triple gate keeps mutator
  RPCs unreachable in this posture, but a flag-on / KEK-off node
  is misconfigured at the operator-intent level. Fail fast at
  startup rather than discover the mismatch via a halted apply
  loop later.

- ErrKEKMismatch — flag on, KEK loaded, sidecar present, at
  least one wrapped DEK fails to unwrap under the configured KEK.
  The classic operator error: --kekFile points at a KEK from a
  different cluster / environment. Annotates the offending key_id
  and purpose so the operator's log line points at the right DEK
  for root cause.

The helper reads the sidecar at most once and is safe to call
before any other encryption package state is constructed; it does
NOT mutate on-disk state.

### internal/encryption/errors.go

Three new sentinels:

- ErrSidecarPresentWithoutFlag
- ErrKEKRequiredWithFlag
- ErrKEKMismatch

All carry a self-contained error message that names the relevant
flag and the recovery path (consistent with §9.1's "Each refusal
logs a single, unambiguous error pointing at the relevant flag
and runbook section").

### main.go

loadKEKAndRunStartupGuards() pairs the KEK load with
CheckStartupGuards in a single helper so run() stays under the
cyclop budget. The guards run BEFORE buildShardGroups constructs
any Raft engine or storage state, so a misconfigured node refuses
to boot rather than halting mid-flight or — worse — silently
downgrading encrypted data to cleartext.

### Stage 6C sub-decomposition rationale (design doc)

The original monolithic Stage 6C bundled guards with three
different dependency footprints (flags+sidecar, raftengine
integration, cluster-wide membership view) plus Phase-2-specific
guards. Shipping the load-bearing downgrade-prevention guard
(6C-1) alone would have been blocked on Stage 7's writer-registry
work (the local_epoch rollback peer of ErrLocalEpochExhausted),
which is several PRs out. The sub-decomposition lets 6C-1 ship
now and unblocks Stage 6D once 6C-1 + 6C-2 have landed.

## Caller audit (semantic changes — signatures)

No public function signatures changed. The new
CheckStartupGuards / StartupConfig types are net-additive.

loadKEKAndRunStartupGuards is new; loadKEKWrapperFromFlag is
unchanged and still callable on its own (the helper calls it
internally).

## Stage 6B-2 RPC gate posture preserved

The Stage 6B-2 triple gate (encryptionMutatorsEnabled) remains
the RPC-boundary safety net. This PR adds a process-boundary
safety net on top of it. Both layers exist by design:

- The RPC gate keeps unreachable mutator paths unreachable when
  the operator has chosen NOT to participate in the rollout.
- The startup gate keeps misconfigured nodes from booting at all
  when the operator HAS opted in but the on-disk state or KEK
  config does not match.

A flag-on / KEK-off node:
- pre-this-PR: would boot, mutator RPCs refused at the RPC gate
- post-this-PR: refuses to boot (ErrKEKRequiredWithFlag)

A flag-off / sidecar-present node:
- pre-this-PR: would boot, silently downgrading new writes to cleartext
- post-this-PR: refuses to boot (ErrSidecarPresentWithoutFlag)

## Five-lens self-review

1. Data loss — the guards are net-positive for data safety: they
   refuse to boot in postures that would otherwise silently
   downgrade encrypted data to cleartext or land a HaltApply.
   No write path touched.

2. Concurrency / distributed failures — the guards run at
   process startup before any goroutine fan-out, before the
   Raft engine starts. No new shared state; ReadSidecar is
   side-effect-free.

3. Performance — single os.Stat + one ReadSidecar (only when
   --encryption-enabled is set AND a sidecar exists on disk)
   per process start. Zero hot-path impact.

4. Data consistency — ErrKEKMismatch catches the operator
   error where the wrong KEK would route writes to a DEK that
   the prior wrapped material cannot be decrypted from. This is
   the most severe data-consistency footgun in the §9.1 guard
   set; catching it at startup prevents the run-and-corrupt
   pattern.

5. Test coverage — 7 sub-cases in TestCheckStartupGuards_*
   cover all three sentinels in both fire and pass states,
   plus the I/O error propagation path and the per-key
   error annotation (purpose=raft) that runbooks grep for.

## Verification

- go test -race -timeout=120s ./internal/encryption/... . — PASS
- go build ./... — PASS
- golangci-lint run . ./internal/encryption/... — 0 issues
- Full Jepsen suite — not run (no envelope cutover happens until
  Stage 6D; this PR only adds startup refusal logic that runs
  before any encryption-relevant Raft entry)

## Plan

Next Stage 6C sub-milestones per the design-doc update in this
PR:

- 6C-2: raftengine integration guards (ErrSidecarBehindRaftLog,
  ErrUnsupportedFilesystem startup probe, ErrLocalEpochExhausted)
- 6C-3: membership-view guards (ErrNodeIDCollision,
  ErrLocalEpochRollback) — bundles with Stage 6D
- 6C-4: Phase-2-specific guards — bundles with Stage 6E
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 17, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 49 minutes and 10 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 69e8ef32-ae05-44b6-bc6c-b87e834025dc

📥 Commits

Reviewing files that changed from the base of the PR and between 5b55020 and a78b660.

📒 Files selected for processing (3)
  • docs/design/2026_04_29_partial_data_at_rest_encryption.md
  • internal/encryption/startup.go
  • internal/encryption/startup_test.go
📝 Walkthrough

Walkthrough

This PR implements startup validation guards for encryption configuration. It adds error contracts, guard orchestration logic with three sequential checks (sidecar-without-flag, KEK-required, KEK-sidecar-mismatch), comprehensive test coverage across allowed and disallowed startup modes, and integrates the guard into main.go process startup before shard group construction.

Changes

Encryption Startup Guard Policy

Layer / File(s) Summary
Startup Guard Error Contracts
internal/encryption/errors.go
Three exported sentinel errors define failure modes: ErrSidecarPresentWithoutFlag when on-disk sidecar exists but encryption disabled, ErrKEKRequiredWithFlag when encryption enabled but no KEK source provided, ErrKEKMismatch when configured KEK cannot unwrap one or more wrapped DEKs in sidecar.
Startup Guard Logic and Config
internal/encryption/startup.go
StartupConfig struct parameterizes guard checks with encryption flags, KEK presence, loaded wrapper, and sidecar path. CheckStartupGuards orchestration function probes sidecar once and sequentially runs three guards, returning first error. Nested guardSidecarWithoutFlag, guardKEKRequired, and guardKEKMatchesSidecar handle filesystem probing, flag validation, and DEK unwrap validation with context annotation.
Startup Guard Test Coverage
internal/encryption/startup_test.go
Test helpers writeTestSidecar and newTestKEK build deterministic sidecars and file-backed KEK wrappers. Test suite validates six allowed and disallowed startup modes: encryption disabled/no sidecar, enabled with KEK/no sidecar, enabled with KEK and matching wrapped DEKs, sidecar present without flag rejection, KEK missing rejection, KEK-sidecar mismatch detection for storage and raft purposes independently, and sidecar stat error propagation.
Process Startup Integration
main.go
run() calls new loadKEKAndRunStartupGuards() helper instead of directly loading KEK. Helper loads file-backed KEK, runs encryption.CheckStartupGuards with current --encryption-enabled, --kekFile, and --encryptionSidecarPath configuration, and returns loaded wrapper on success or wrapped error on refusal. Process startup fails before shard groups are constructed if guards reject configuration.
Design Doc: Stage 6C Startup Guard Decomposition
docs/design/2026_04_29_partial_data_at_rest_encryption.md
Stage 6C "startup refusal guards" milestone is decomposed into four sub-milestones (6C-1 through 6C-4) with explicit guard/error coverage and dependency notes. New rationale section explains guard dependencies (flags/sidecar state vs raft integration vs membership view), why bundling all prevents prerequisite work, shipping order (6C-1 → 6C-2 → 6D with 6C-3 bundled → 6E with 6C-4 bundled), and dependency graph.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • bootjp/elastickv#719: Related through foundational internal/encryption error definitions in the same file.
  • bootjp/elastickv#776: Related through KEK wiring in Stage 6B-2; main.go's startup guard integration reuses the existing KEK loader flow introduced in that PR.
  • bootjp/elastickv#760: Related through --encryptionSidecarPath gating encryption startup behavior; both PRs connect through sidecar/flag startup validation.

Poem

🐰 Guards at startup, standing tall,
Checking sidecars, KEKs through all,
No mismatches slip on by,
Before shard groups reach the sky! 🔐✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically identifies the main change: implementing Stage 6C-1 startup-refusal guards for encryption (flag + sidecar state).
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/encryption-stage6c

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 17, 2026

@claude review

Stage 6C-1 opening review. Scope: three §9.1 startup-refusal guards (ErrSidecarPresentWithoutFlag, ErrKEKRequiredWithFlag, ErrKEKMismatch) + sub-decomposition of the original monolithic Stage 6C into 6C-1 → 6C-4 per dependency footprint. Built on top of just-merged Stage 6B-2 (PR #776).

Key safety properties to validate:

  • Downgrade prevention: flag-off / sidecar-present must refuse to boot (not silently degrade to cleartext)
  • Fail-fast: flag-on / KEK-off must refuse to boot (not boot with mutators disabled and confuse on-callers)
  • Wrong-KEK detection: KEK loaded but unwrap-mismatch must refuse to boot with per-key annotation (root-cause)
  • The startup gate is layered ON TOP OF the Stage 6B-2 RPC-boundary gate, not replacing it
  • All three new sentinels carry self-contained error messages naming the relevant flag + recovery path

The sub-decomposition is also up for review (design doc Stage 6 rationale).

@claude
Copy link
Copy Markdown

claude Bot commented May 17, 2026

Claude finished @bootjp's task in 3m 20s —— View job


Stage 6C-1 Review

  • Gather context and read changed files
  • Review internal/encryption/errors.go — sentinel definitions
  • Review internal/encryption/startup.go — guard logic
  • Review internal/encryption/startup_test.go — test coverage
  • Review main.go integration
  • Review design doc sub-decomposition
  • Post findings

Summary

The three §9.1 guards are correctly implemented and the integration in main.go runs at the right point (before buildShardGroups). The core safety properties all hold. One doc bug (duplicate 6D row), one non-determinism concern for multi-DEK mismatch errors, and a few minor test-quality nits.


[BUG] Duplicate 6D row in the design doc

docs/design/2026_04_29_partial_data_at_rest_encryption.md lines 27–28 contain two rows for 6D. The old row ("depends on 6B for mutator wiring AND 6C for §9.1...") was not removed after the new one ("bundles 6C-3 for the membership-view-dependent collision check") was added:

| 6D | ... bundles 6C-3 for the membership-view-dependent collision check — see rationale) | open | — |
| 6D | ... depends on 6B for mutator wiring AND 6C for §9.1 startup-refusal guards — see rationale) | open | — |

The second row is a stale copy-paste artifact. Fix this →


[MEDIUM] Non-deterministic error annotation when multiple DEKs fail guardKEKMatchesSidecar

startup.go:175 iterates sc.Keys (a map[string]SidecarKey) with no ordering:

for idStr, k := range sc.Keys {

If two or more wrapped DEKs fail to unwrap, the reported key_id / purpose in the ErrKEKMismatch annotation is non-deterministic (whichever the map lands on first). This is a wrong-KEK scenario, so the guard fires correctly regardless — but a node that restarts twice can log different key_id values for the same misconfiguration, which makes runbook correlation harder.

The fix is to collect all failing IDs and either report all of them or iterate in sorted key_id order. For Stage 6C-1 the sidecar normally has exactly two entries (storage + raft), but it's worth locking this down since the error message is documented as a runbook grep target. Fix this →


Guard logic — all three properties verified ✓

Downgrade prevention (ErrSidecarPresentWithoutFlag):
startup.go:131if cfg.EncryptionEnabled || !sidecarPresent { return nil }. Fires exactly when flag is off AND sidecar is present. ✓

Fail-fast (ErrKEKRequiredWithFlag):
startup.go:144–150 — Short-circuits on !EncryptionEnabled; fires only when EncryptionEnabled && !KEKConfigured. The 6B-2 RPC gate and this startup gate are complementary: the RPC gate prevents a running node from accepting mutators, this gate prevents it from booting at all. ✓

Wrong-KEK detection (ErrKEKMismatch):
startup.go:168!cfg.EncryptionEnabled || !sidecarPresent || cfg.KEK == nil is the correct early-return trifecta: flag-off path is caught by guard 1, empty-sidecar path doesn't fire, nil-KEK path is caught by guard 2. The Unwrap error propagates the purpose annotation. ✓

Guard ordering: The sequence guardSidecarWithoutFlag → guardKEKRequired → guardKEKMatchesSidecar is correct. The downgrade case is the most critical (fires first). The KEK-required case can only reach guard 3 if KEKConfigured=true, which implies KEK != nil from loadKEKAndRunStartupGuards's construction invariant.


main.go integration ✓

loadKEKAndRunStartupGuards (line 824) correctly:

  • Calls loadKEKWrapperFromFlag first so load failures abort before any guard check
  • Constructs StartupConfig using actual flag values (*encryptionEnabled, *kekFile != "", kekWrapper, *encryptionSidecarPath)
  • Returns nil, err when CheckStartupGuards fires, which surfaces to run() and exits before buildShardGroups

The renaming of loadKEKWrapperFromFlag()loadKEKAndRunStartupGuards() in the run() call site is clean.


sidecarOnDisk I/O error handling ✓

startup.go:119errors.Is(err, os.ErrNotExist) distinguishes file-not-found from genuine I/O errors (permission denied, path traversal failure, NUL byte). The TestCheckStartupGuards_SidecarStatError test covers this path. ✓


Test coverage — minor gaps

TestCheckStartupGuards_KEKMismatch_RaftOnly is deterministic despite map iteration randomness: only the raft DEK is wrapped under wrongKEK, so regardless of which key is visited first, the raft DEK will fail and annotate purpose="raft". ✓

Missing: "sidecar present but all wrapped fields empty" — The comment in guardKEKMatchesSidecar mentions the "freshly created empty sidecar" case (empty Wrapped bytes are skipped via len(k.Wrapped) == 0 { continue }), but there's no test covering a sidecar file that exists with keys having zero-length Wrapped. Adding one would lock down the guard's "bootstrap not yet committed" early-pass behavior. Not blocking, but worth adding.

contains helper (startup_test.go:254) reimplements strings.Contains to "avoid cluttering the import diff." strings is stdlib; just use it. Fix this →

newTestKEK filename encoding (startup_test.go:55) uses string(rune(seed)) for potentially non-ASCII seeds (0xAA → 2-byte UTF-8 path component). Works on Linux, but fmt.Sprintf("kek-%02x.bin", seed) is clearer and avoids the implicit byte→rune→string narrowing.


Design doc sub-decomposition ✓

The 6C-1 → 6C-4 split along dependency footprint is well-justified. The blocking relationship (6C-1 + 6C-2 unblocks 6D, 6C-3 bundles with 6D's membership fan-out, 6C-4 bundles with 6E's Phase-2 cutover) follows naturally from the implementation dependencies. Shipping the load-bearing downgrade-prevention guard (ErrSidecarPresentWithoutFlag) now without waiting for the Stage 7 writer-registry work is the right call.

One concern: the rationale section at line 143–156 of the doc correctly describes the decomposition, but the duplicate 6D row in the table (see BUG above) will cause confusion when someone reads the doc in isolation.


Required before merge

  • Remove duplicate 6D row from the design doc milestone table

Recommended (non-blocking)

  • Sort sc.Keys iteration in guardKEKMatchesSidecar for deterministic error annotation
  • Add test for "sidecar present, keys present but all Wrapped == nil/empty" early-pass path
  • Replace custom contains with strings.Contains
  • Use fmt.Sprintf("kek-%02x.bin", seed) in newTestKEK

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/design/2026_04_29_partial_data_at_rest_encryption.md`:
- Around line 27-28: Duplicate Stage 6D rows (both referencing
`enable-storage-envelope`, §7.1 Phase-1 storage cutover and the Voters ∪
Learners capability gate) conflict because one lists dependencies as `6C-1+6C-2`
and the other as `6C`; remove the outdated/duplicate row that uses `6C-1+6C-2`
and keep the updated row that uses `6C` so only a single Stage 6D entry remains
(preserving the `enable-storage-envelope`, §7.1, and (§6.2 toggle ON) wording).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e0b79fa8-307c-4a8d-a260-1acd1b9bb35e

📥 Commits

Reviewing files that changed from the base of the PR and between e9b015c and 5b55020.

📒 Files selected for processing (5)
  • docs/design/2026_04_29_partial_data_at_rest_encryption.md
  • internal/encryption/errors.go
  • internal/encryption/startup.go
  • internal/encryption/startup_test.go
  • main.go

Comment thread docs/design/2026_04_29_partial_data_at_rest_encryption.md Outdated
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements the first phase of the encryption startup refusal guards (Stage 6C-1) as outlined in the updated design documentation. It introduces several safety checks to prevent silent downgrades to cleartext, ensure a KEK is provided when encryption is enabled, and verify that the configured KEK can successfully unwrap existing Data Encryption Keys (DEKs). These guards are integrated into the main startup sequence to fail fast before the Raft engine or storage state is initialized. Feedback suggests improving error reporting when parsing key IDs in the sidecar to provide better context during troubleshooting.

Comment on lines +180 to +187
id, parseErr := strconv.ParseUint(idStr, 10, 32)
if parseErr != nil {
// validateSidecar would have caught a non-decimal
// key_id, but be defensive in case ReadSidecar's
// invariant changes.
return pkgerrors.Wrapf(ErrKEKMismatch,
"sidecar=%q key_id=%q: %v", cfg.SidecarPath, idStr, err)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When strconv.ParseUint fails, the error returned should ideally include the parsing error (parseErr) to provide more context about why the key_id could not be processed, rather than just the Unwrap error (err). While validateSidecar is expected to catch malformed IDs, being explicit here aids in troubleshooting manual sidecar edits.

id, parseErr := strconv.ParseUint(idStr, 10, 32)
			if parseErr != nil {
				return pkgerrors.Wrapf(ErrKEKMismatch,
					"sidecar=%q key_id=%q (parse error: %v): %v", cfg.SidecarPath, idStr, parseErr, err)
			}

…icate Stage 6D row)

Claude bot flagged "Required before merge" and coderabbit
flagged Minor on the same finding: the milestone table in
docs/design/2026_04_29_partial_data_at_rest_encryption.md
had two Stage 6D rows with conflicting dependency wording.

  - Line 27 (kept): "depends on 6B AND 6C-1+6C-2 for §9.1
    guards; bundles 6C-3 for membership-view collision check"
    — the updated wording reflecting the 6C-1..6C-4 split.
  - Line 28 (removed): "depends on 6B AND 6C for §9.1 guards"
    — the older monolithic-6C wording, stale after this PR's
    sub-decomposition.

The duplicate would cause rollout-order ambiguity for any
reader using the table as a checklist (Claude's "concern"
in the design-doc review). Deleting the stale row resolves
both findings; the surviving row is the correct one and the
narrative in the "Stage 6 sub-milestone rationale" section
matches it.

Docs-only change. No code paths touched, no caller audit
needed.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 17, 2026

@claude review

Round-1 (commit f69f13f5): addressed Claude r1 "Required before merge" + coderabbit minor on the same finding.

Finding (claude required-before-merge + coderabbit minor)

Remove duplicate/conflicting Stage 6D row.
Line 27 and Line 28 both define Stage 6D with different dependency wording.

Resolution: deleted line 28 (depends on 6B AND 6C — the older monolithic-6C wording, stale after the 6C-1..6C-4 split this PR introduces). Kept line 27 (depends on 6B AND 6C-1+6C-2; bundles 6C-3 — the updated wording).

Docs-only change. Caller audit not applicable.

Non-blocking items deferred

The following Claude r1 "Recommended (non-blocking)" items remain open:

  • Sort sc.Keys iteration in guardKEKMatchesSidecar for deterministic error annotation (test TestCheckStartupGuards_KEKMismatch_RaftOnly is already deterministic per Claude's analysis; the concern is only about which purpose field the error reports when both raft and storage DEKs are wrapped under a wrong KEK).
  • Add test for "sidecar present, keys present but all Wrapped == nil/empty" early-pass path.
  • Replace custom contains helper with strings.Contains.
  • Use fmt.Sprintf("kek-%02x.bin", seed)" in newTestKEK`.

Per the loop's P0/P1/high/critical/major scope these are below threshold. Happy to address in a follow-up if reviewers prefer them in this PR.

@claude
Copy link
Copy Markdown

claude Bot commented May 17, 2026

Claude finished @bootjp's task in 1m 37s —— View job


Stage 6C-1 Review — Round 2

  • Verify duplicate 6D row fix in design doc
  • Verify current state of internal/encryption/startup.go
  • Assess deferred non-blocking items
  • Post r2 findings

Required fix (r1) — confirmed resolved ✓

Commit f69f13f removes the stale row correctly. The design-doc milestone table now has exactly one 6D entry:

| 6D | ... depends on 6B for mutator wiring AND 6C-1+6C-2 for §9.1
     startup-refusal guards; bundles 6C-3 ...

The old "depends on 6B AND 6C" row (monolithic-6C wording) is gone. No structural issues with the rationale section or the sub-decomposition prose. ✓


Deferred non-blocking items — current state

All four items are still open per bootjp's note above. Status:

Item File Status
Sort sc.Keys map iteration for deterministic ErrKEKMismatch annotation startup.go:175 Open (deferred)
Test "sidecar present, all Wrapped == nil/empty" early-pass path startup_test.go Open (deferred)
Replace custom contains helper with strings.Contains startup_test.go:254 Open (deferred)
fmt.Sprintf("kek-%02x.bin", seed) in newTestKEK startup_test.go:53 Open (deferred)

The deferral reasoning holds. The TestCheckStartupGuards_KEKMismatch_RaftOnly test is still deterministic in practice (only the raft DEK is wrapped under wrongKEK, so map randomness cannot change which key fires). The map-iteration concern only surfaces when both DEKs are wrapped under the wrong KEK — a scenario not yet tested. Acceptable to defer.


Gemini's strconv.ParseUint parseErr comment — not yet addressed

startup.go:185:

return pkgerrors.Wrapf(ErrKEKMismatch,
    "sidecar=%q key_id=%q: %v", cfg.SidecarPath, idStr, err)

When parseErr != nil, the format includes err (the Unwrap error) but not parseErr itself. Gemini suggested adding : %v for parseErr. This path is guarded by the comment "validateSidecar would have caught a non-decimal key_id" — the format already includes the raw key_id=%q string so an operator can still root-cause manually. No data loss, no correctness concern. Minor omission in an already-defensive branch.

Whether to fix now or in a follow-up is the author's call. If deferred, add it to the same follow-up batch as the four items above.


Overall assessment

The r1 required fix is clean. Core safety properties remain intact:

  • Downgrade prevention (ErrSidecarPresentWithoutFlag): startup.go:131 — fires when flag is off AND sidecar is present. ✓
  • Fail-fast (ErrKEKRequiredWithFlag): startup.go:144–150 — fires when flag is on AND no KEK source. ✓
  • Wrong-KEK detection (ErrKEKMismatch): startup.go:168 — correct trifecta early-return; per-purpose annotation on error. ✓
  • Guard ordering and main.go integration (before buildShardGroups) unchanged and correct. ✓

Ready to merge. The four deferred non-blocking items + the parseErr omission can land in a follow-up cleanup commit alongside 6C-2.

… test polish

PR #778 round-1 code-side fixes (the docs-side fix for the
duplicate Stage 6D row was already addressed in f69f13f).

## claude r1 [MEDIUM] — Non-deterministic ErrKEKMismatch annotation

guardKEKMatchesSidecar iterated sc.Keys (a Go map) without
sorting, so when multiple wrapped DEKs failed to unwrap the
reported key_id / purpose in the ErrKEKMismatch annotation was
non-deterministic across restarts. The classic operator scenario
(wrong --kekFile entirely) makes EVERY wrapped DEK fail, so the
first failure picked by map iteration was effectively random.

Fix: sortedSidecarKeyIDs() returns the keys in ascending numeric
key_id order. Iteration now always picks the smallest key_id
that fails, which gives reproducible runbook log correlation
across restarts. The Stage 5.1 sidecar invariant
(validateSidecar requires decimal-uint32 keys) lets us sort by
ParseUint; a malformed entry is sorted to the end via the
math.MaxUint32 sentinel so the iteration order stays defined
even if invariant drift lets a bad entry through.

TestCheckStartupGuards_KEKMismatch_DeterministicAnnotation pins
the regression: a sidecar with storage=1 + raft=2 both wrapped
under the wrong KEK MUST always annotate key_id=1 / purpose=storage,
verified across 10 iterations to catch the randomized map order.

## claude r1 [Recommended] — Custom contains helper

Replaced custom `contains` with `strings.Contains`. The original
"avoid cluttering the import diff" reasoning was wrong-headed:
strings is stdlib and the substring assertion is the only place
in the file that needs anything from it.

## claude r1 [Recommended] — newTestKEK filename encoding

Replaced `string(rune(seed))` with `fmt.Sprintf("kek-%02x.bin", seed)`.
The original encoding produced 2-byte UTF-8 path components for
non-ASCII seeds (0xAA → "kek-\xc2\xaa.bin"); the hex encoding
keeps the on-disk test artifact names predictable and readable.

## claude r1 [Recommended] — Empty-Wrapped early-pass test

TestCheckStartupGuards_EmptyWrappedSkipped locks down the godoc
claim that the guard "Returns nil ... when ... the sidecar exists
but has no wrapped DEKs". The test writes a sidecar with two
SidecarKey entries that BOTH have empty Wrapped bytes (one nil,
one empty slice for variation) and asserts the guard returns nil
instead of calling KEK.Unwrap on zero bytes (which would fire a
misleading ErrKEKMismatch).

## Caller audit (semantic changes)

sortedSidecarKeyIDs is new, pure, no external callers yet.
guardKEKMatchesSidecar internal signature unchanged. The
ErrKEKMismatch error annotation gains determinism but the
overall fail-closed semantic is unchanged — callers that only
inspect errors.Is(err, ErrKEKMismatch) see no change in behaviour.

## Verification

- go test -race -timeout=120s ./internal/encryption/... — PASS
- go build ./... — PASS
- golangci-lint run ./internal/encryption/... — 0 issues
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 17, 2026

@claude review

Round-2 (commit a78b6606): round-1 code-side fixes (the design-doc fix landed in f69f13f5).

claude r1 [MEDIUM] addressed

guardKEKMatchesSidecar now iterates sc.Keys in ascending numeric key_id order via the new sortedSidecarKeyIDs helper. Multi-failure ErrKEKMismatch annotations are now reproducible across restarts.

TestCheckStartupGuards_KEKMismatch_DeterministicAnnotation pins the regression: storage=1 + raft=2 both wrapped under the wrong KEK MUST always annotate key_id=1 / purpose=storage, verified across 10 iterations to flush Go's randomized map order.

claude r1 [Recommended] addressed

  • Replaced custom contains with strings.Contains (startup_test.go)
  • newTestKEK filename now fmt.Sprintf("kek-%02x.bin", seed) instead of string(rune(seed)) — keeps non-ASCII seeds from producing 2-byte UTF-8 path components
  • New TestCheckStartupGuards_EmptyWrappedSkipped locks down the "sidecar present, all wrapped fields empty" early-pass behaviour (godoc claim is now test-enforced)

Caller audit

sortedSidecarKeyIDs is new, pure, no external callers. guardKEKMatchesSidecar internal signature unchanged. Fail-closed semantics preserved: callers that only inspect errors.Is(err, ErrKEKMismatch) see no behaviour change.

@claude
Copy link
Copy Markdown

claude Bot commented May 17, 2026

Claude finished @bootjp's task in 1m 8s —— View job


Stage 6C-1 Review — Round 3

  • Gather context
  • Read internal/encryption/startup.go — verify sortedSidecarKeyIDs + deterministic iteration
  • Read internal/encryption/startup_test.go — verify new tests + strings.Contains + filename fix
  • Verify Gemini parseErr finding status
  • Post findings

r1 MEDIUM — confirmed resolved ✓

sortedSidecarKeyIDs (startup.go:213–231) is correctly implemented:

  • Numeric sort via strconv.ParseUint(_, 10, 32) — avoids lexicographic order ("10" < "2" problem with string sort) ✓
  • Malformed key IDs sort to the end via ^uint32(0) sentinel cast to uint64 — defined order even if validateSidecar's invariant drifts ✓
  • guardKEKMatchesSidecar now does k := sc.Keys[idStr] after sorting (line 184), which is correct ✓
  • The comment at line 163–165 cites the PR#778 root cause — good runbook breadcrumb ✓

TestCheckStartupGuards_KEKMismatch_DeterministicAnnotation (startup_test.go:265–306):

  • 10-iteration loop flushes Go's randomized map order reliably ✓
  • Verifies both key_id=1 AND purpose="storage" on every iteration ✓
  • Comment explains the adversarial construction: storage=1 / raft=2 so insertion-order iteration would "accidentally" pass some runs ✓

r1 Recommended — confirmed resolved ✓

Item Location Verified
strings.Contains replaces custom helper startup_test.go:8 (import), :252, :299, :302, :303 ✓ — contains helper is gone
fmt.Sprintf("kek-%02x.bin", seed) in newTestKEK startup_test.go:59 ✓ — non-ASCII seeds are safe
TestCheckStartupGuards_EmptyWrappedSkipped startup_test.go:318–344 ✓ — see note below

TestCheckStartupGuards_EmptyWrappedSkipped note: The test populates both nil (key "3") and []byte{} (key "4") to cover both empty representations, which is more thorough than a single empty-slice case. Active fields stay at zero so validateSidecar accepts the structurally valid-but-unbootstrapped sidecar. The len(k.Wrapped) == 0 guard in guardKEKMatchesSidecar handles both cases identically — the test locks that down correctly.


Gemini parseErr finding — still deferred, still acceptable

startup.go:194–195 — when strconv.ParseUint fails in the defensive branch, the returned error wraps ErrKEKMismatch with key_id=%q (the raw string) but omits parseErr itself. The raw string is included in the annotation, so an operator can still root-cause. The deferral is reasonable given the "validateSidecar would have caught this" invariant comment.

If this ships as a follow-up cleanup, add it to the same batch as the 6C-2 commit to avoid a one-liner PR.


No new issues from r2 changes

The sortedSidecarKeyIDs helper is pure (no side effects, no globals) and the type dance (uint32 constant → uint64 comparison) is correct. Guard semantics, the main.go integration, and the error sentinel chain are unchanged from r1. ✓


Ready to merge. All r1 required + recommended items are resolved. The parseErr annotation omission remains the one open minor item; the remaining follow-up batch from r1 is empty.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant