Skip to content

fix(qbft_manager): fast-fail expired late callers when no entry remains#3

Open
peter941221 wants to merge 1 commit intodiegomrsantos:test/qbft-late-roundsfrom
peter941221:codex/issue-914-post-deadline-fast-fail
Open

fix(qbft_manager): fast-fail expired late callers when no entry remains#3
peter941221 wants to merge 1 commit intodiegomrsantos:test/qbft-late-roundsfrom
peter941221:codex/issue-914-post-deadline-fast-fail

Conversation

@peter941221
Copy link
Copy Markdown

PR Body 914

Title

qbft_manager: fast-fail expired late callers when no entry remains

Base / Stack Note

  • This PR is intentionally stacked on top of sigp/anchor#719.
  • Reviewer baseline: #719 already keeps decided instances alive until their beacon-chain deadline, so pre-deadline late callers can still receive the cached result.
  • This PR targets the narrower remaining gap from sigp/anchor#914: once cleanup has removed the entry entirely, a post-deadline local caller should not respawn a useless shadow instance.

Summary

Two local duty paths can independently call decide_instance(...) for the same logical committee-slot decision. After #719, a decided instance is intentionally retained until its deadline so late callers can still reuse the result. The remaining hole is the post-deadline case: once the cleaner removes the entry, a later caller can still hit Vacant, spawn a duplicate instance, and then wait through a pointless max-round timeout.

This patch makes that post-deadline path fail immediately when the entry is already gone, while preserving the current reuse path when an entry is still registered.

Changes

  • add a small current_slot_exceeds_deadline(...) helper in qbft_manager
  • in decide_instance(...):
    • if current_slot > deadline and no entry exists, return Ok(Completed::TimedOut) immediately
    • if current_slot > deadline and an entry still exists, preserve the current reuse path
    • otherwise keep the existing spawn-or-join behavior
  • add regression coverage for:
    • expired late caller returns immediate timeout and does not spawn a shadow instance
    • a still-registered decided instance can still serve a late caller after deadline but before cleanup runs

Why This Shape

I considered a blunter rule of "if expired, always fast-fail", but that would regress the good part of #719: a late caller should still be able to reuse a decided result while the entry is intentionally retained before cleanup.

So the narrower invariant here is:

  • expiry should block respawn when the entry is absent
  • expiry should not suppress a still-registered result window that #719 already established

Validation

  • cargo fmt --all --check
  • cargo test -p qbft_manager
  • cargo clippy -p qbft_manager --tests -- -D warnings
  • git diff --check

Residual Scope

This PR only seals the local late-caller path in decide_instance(...).

receive_data(...) -> pass_to_instance(...) still uses get_or_spawn_instance(...) unconditionally, so expired inbound network traffic remains a separate lifecycle-policy concern. I am keeping that out of this patch because it likely belongs with the broader timing-policy consolidation discussed in sigp/anchor#916.

References

  • Issue: https://github.com/sigp/anchor/issues/914
  • Dependency baseline: https://github.com/sigp/anchor/pull/719
  • Related architecture issue: https://github.com/sigp/anchor/issues/916
  • Protocol deadline context: EIP-7045 https://eips.ethereum.org/EIPS/eip-7045
  • Concurrency background: Herlihy and Wing, "Linearizability: A Correctness Condition for Concurrent Objects" https://doi.org/10.1145/78969.78972
  • Timing-model background: Dwork, Lynch, and Stockmeyer, "Consensus in the Presence of Partial Synchrony" https://doi.org/10.1145/42282.42283

@peter941221 peter941221 changed the title qbft_manager: fast-fail expired late callers when no entry remains fix(qbft_manager): fast-fail expired late callers when no entry remains Apr 1, 2026
@peter941221
Copy link
Copy Markdown
Author

Quick follow-up on this narrow stacked fix because linked issue #914 is still open and this PR has been quiet since the initial push.

The branch is still intentionally narrow and still intentionally stacked on sigp/anchor#719 rather than trying to reopen the broader cleanup-policy discussion here.

The claim remains just:

  • once deadline cleanup has already removed the entry, a post-deadline local caller should fast-fail with Completed::TimedOut instead of spawning a useless shadow instance
  • if the entry is still registered, keep the existing reuse path from #719

All visible checks on this PR are still green.

If you would prefer to wait for #719 to settle first, that is completely fine. I can keep this branch stacked, rebase it after #719, or respin the same narrow fix directly on top of whatever base you would rather review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant