Skip to content

pos consensus: remove dormant Recovery state machine#3457

Open
peilun-conflux wants to merge 1 commit intoConflux-Chain:masterfrom
peilun-conflux:pos-absorb-recovery-into-normal
Open

pos consensus: remove dormant Recovery state machine#3457
peilun-conflux wants to merge 1 commit intoConflux-Chain:masterfrom
peilun-conflux:pos-absorb-recovery-into-normal

Conversation

@peilun-conflux
Copy link
Copy Markdown
Contributor

@peilun-conflux peilun-conflux commented Apr 24, 2026

RoundProcessor::Recovery + RecoveryManager + LedgerRecoveryData + BlockStore::fast_forward_sync exist to reconstruct consensus state when RecoveryData::new can't locate a root block matching the latest committed ledger info. That scenario has no remaining trigger:

  • Fresh-genesis startup already goes to Normal. maybe_bootstrap writes a genesis LedgerInfo with next_epoch_state = Some(...), so find_root takes its ends_epoch() branch and synthesises a virtual genesis block + QC from the ledger info alone.
  • Normal running-node restarts also go to Normal. The write path is save_tree(block+qc) → execute → commit(ledger) → prune, and prune_tree preserves the root. On any clean crash/restart the root is in ConsensusDB and find_root resolves it directly.
  • The residual "state-sync ran ahead of consensus" case from upstream Diem was neutralised in Migrate parity-secp256k1 to secp256k1 0.30 #3438 (397cf34c). state_computer.sync_to became a no-op, so fast_forward_sync can only write blocks to ConsensusDB — strictly weaker than Normal's sync path, which executes and commits.

What Recovery would technically still fire on is on-disk corruption where ConsensusDB is missing the block PosLedgerDB's latest ledger info points at. That combination isn't reachable through normal operation, and the sync_to-neutered recovery attempt couldn't repair the ledger even if we took it. A panic on that condition is strictly more diagnostic than silently stalling in Recovery.

Follow-on removals fall out once Recovery is gone: StateComputer::sync_to + StateSyncError + LogEvent::StateSync had no remaining callers.


This change is Reviewable

`RoundProcessor::Recovery` and its machinery exist to reconstruct consensus
state when `RecoveryData::new` fails to locate a root block matching the
latest committed ledger info. That scenario has no remaining trigger:

- Fresh-genesis startup already goes to Normal: `maybe_bootstrap` writes a
  genesis `LedgerInfo` with `next_epoch_state = Some(...)`, so
  `find_root` takes its `ends_epoch()` branch and synthesises a virtual
  genesis block + QC from the ledger info alone.
- Normal running-node restarts also go to Normal: the consensus write
  path does `save_tree(block+qc) -> execute -> commit(ledger) -> prune`,
  with `prune_tree` preserving the root. On any clean crash/restart the
  root block is in ConsensusDB, so `find_root` succeeds directly.
- The residual "state-sync ran ahead of consensus" case from upstream
  Diem was neutralised when the state-sync coordinator was deleted in
  397cf34 — `state_computer.sync_to` became a no-op, so
  `fast_forward_sync` can no longer catch a ledger up anyway.

What the Recovery path would actually handle today is on-disk corruption
where `ConsensusDB` is missing the block that `PosLedgerDB`'s latest
ledger info points at. That combination isn't reachable through normal
operation, and the recovery attempt (now `sync_to`-neutered) couldn't
repair the ledger even if we took it. A panic on that condition is
strictly more diagnostic than silently stalling in Recovery.

Collapses:

- `RoundProcessor` enum -> `Option<RoundManager>`. The dispatch
  match in `process_event`, `filter_unverified_event`, `processor_mut`,
  and every `TestCommand` handler goes away.
- `LivenessStorageData` enum and `LedgerRecoveryData` struct merge into
  `RecoveryData`; `PersistentLivenessStorage::start` now returns
  `RecoveryData` directly (panics on the corruption case above).
  `find_root`'s virtual-genesis synthesis moves from `LedgerRecoveryData`
  into a free function taking `&LedgerInfo`.
- `RecoveryManager`, `start_recovery_manager`,
  `expect_recovery_data`, and `BlockStore::fast_forward_sync` are all
  deleted. `StateComputer::sync_to` and `StateSyncError` likewise —
  they had no remaining callers once `fast_forward_sync` was gone.
- `MockStorage::start` / `EmptyStorage::start` now return
  `RecoveryData` directly and panic on inconsistent setups (previously
  returned a `LedgerRecoveryData` variant that tests then had to unwrap).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@peilun-conflux peilun-conflux force-pushed the pos-absorb-recovery-into-normal branch from 2fe7163 to 3e16f64 Compare April 29, 2026 15:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant