feat: add attestation aggregate coverage metrics#386
Conversation
🤖 Kimi Code ReviewThis PR adds attestation aggregate coverage metrics following the existing patterns in the codebase. The implementation is correct and idiomatic. Assessment:
Minor observation:
No security vulnerabilities or consensus risks identified—this is pure instrumentation with no state transition logic. Verdict: LGTM. Clean, well-documented addition that follows established patterns. Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt |
🤖 Codex Code ReviewNo findings. This PR is limited to new Prometheus label constants and gauge registration in metrics.rs and startup seeding in metrics.rs. I don’t see any impact on fork choice, attestation validation, justification/finalization, state transition, XMSS handling, or SSZ behavior. The added code is straightforward, uses the existing One residual note: this patch only registers/seeds the new metrics; there are no writer APIs or call sites in this PR, so the metrics will stay at their seeded values until follow-up instrumentation lands. I also couldn’t complete Automated review by OpenAI Codex · gpt-5.4 · custom prompt |
🤖 Claude Code ReviewHere is the full code review: PR #386 —
|
Greptile SummaryRegisters three new
Confidence Score: 5/5This PR is safe to merge — it is purely additive metric registration with no changes to any logic, state, or existing code paths. The change adds three new IntGaugeVec statics and seeds 18 default zero series in init(). No existing code is modified, no consumers of the new metrics exist yet, and the implementation closely mirrors the patterns already established in the file. No files require special attention. The single changed file is straightforward metric registration.
|
| Filename | Overview |
|---|---|
| crates/blockchain/src/metrics.rs | Adds three IntGaugeVec statics and two pub label-set constants; seeds 18 default zero series in init(). Follows all existing patterns; no logic, no consumers yet. |
Reviews (1): Last reviewed commit: "feat: add attestation aggregate coverage..." | Re-trigger Greptile
MegaRedHand
left a comment
There was a problem hiding this comment.
We still need to emit the metrics
|
Instrumentation added in 855f56d. Five emission sites now write to the 18 series registered in e9a04f7:
Store gets a small
Ready for another look. |
|
@MegaRedHand please re-review the PR |
Port leanSpec PR #735: register three IntGaugeVec metrics describing
attestation aggregate coverage, with default zero-valued series so
dashboards render from a fresh node startup.
- lean_attestation_aggregate_coverage_validators (section, subnet)
- lean_attestation_aggregate_coverage_subnets (section)
- lean_attestation_aggregate_coverage_diff_validators (direction)
ATTESTATION_AGGREGATE_COVERAGE_SECTIONS and
ATTESTATION_AGGREGATE_COVERAGE_DIFF_DIRECTIONS are exported as the
single source of truth for label sets. init() forces the new statics
and seeds combined-subnet, section, and direction series to 0 (18
default series total). Per-subnet (subnet="subnet_N") series appear
lazily when instrumentation writes them.
Registration only. The producer side (per-slot coverage computation,
de-duplication across payloads, and the chain-status log line) ports
zeam #876 and lands in a follow-up PR.
The diff_validators help text diverges from upstreams terse phrasing
to spell out the symmetric-difference semantics (block_only: in block
but not in local timely pool; timely_only: the reverse). Metric name,
labels, and values are unchanged.
b879490 to
6622da5
Compare
Port the producer side of zeam #876 on top of the metrics registered in
the previous commit. After this commit, all 18 coverage series receive
real per-slot updates from chain activity.
Five emission sites:
- accept_new_attestations (store.rs): captures `new_payloads`
participant bits BEFORE promote and stashes them as a
CoverageSnapshot on the Store. Read at the next slot boundary to
populate the `timely` section ("prev_new" in zeam).
- on_block_core (store.rs): mirrors the imported block s per-AttData
aggregation bits into Store::last_block_coverage. Observability-only;
fork choice is unchanged.
- on_tick interval 0 (lib.rs): emits the post-block-merge report for
`slot - 1`. Computes `timely`/`late`/`block`/`combined` from the
stashed snapshots and the current `new_payloads`, then emits the
diff_validators direction counts as the symmetric difference between
`block` and `timely`.
- start_aggregation_session (lib.rs): emits `agg_start_new` from the
current `new_payloads` right before fork-choice aggregation runs at
interval 2.
- propose_block (lib.rs): emits `proposal_payloads`,
`proposal_gossip`, and `proposal_combined` after the block is built.
Each validator set in the block is classified by whether the
AttestationData has a matching known-payload proof.
New module crates/blockchain/src/coverage.rs holds the Coverage type
(seen + has_subnet bitsets, derived subnet via vid % committee_count to
match the gossip subnet assignment) plus the 3 emission helpers and 6
unit tests covering add_bits, merge_from, diff_counts, empty/zero/out-
of-range edge cases.
Storage gets a CoverageSnapshot type and two Arc<Mutex<Option<…>>>
fields on Store. No proofs are duplicated — only AggregationBits are
captured, keeping the per-slot allocation in the tens of bytes per
entry. The pre-merge capture happens inside accept_new_attestations
just before promote_new_aggregated_payloads, so consumer-side timing
concerns stay in the existing tick path.
BlockChain::spawn now takes attestation_committee_count as a
parameter; bin/ethlambda/src/main.rs already resolves the value
(CLI > validator-config.yaml > 1) and passes it through. The number
of attestation committees was previously only known to P2P (for
subnet subscriptions); the coverage emitters need it to derive
subnet ids.
Delete crates/blockchain/src/coverage.rs and move the three coverage emitters (post_block, agg_start_new, proposal) into lib.rs as private free functions. The Coverage struct is replaced with Vec<bool> locals (seen for validators, has_subnet for subnets), plus three small helpers (cov_add, cov_record, or_into). Each emit_* function was called from exactly one site, so the previous abstraction added no DRY benefit. The 6 unit tests in coverage.rs were exercising the wrappers bitset ops, not the emission contract, so they go away with the wrapper. Behavior, metric names, labels, and values are unchanged. Net effect on the PR: roughly -200 lines.
…sip coverage emit_proposal_coverage classified block-included validators as payload- vs gossip-sourced by checking known_aggregated_payloads. But ethlambda builds blocks exclusively from that same pool (build_block -> extend_proofs_greedily), so every included validator is by construction payload-seen: proposal_gossip was always 0 and proposal_payloads always equalled proposal_combined. Keep only proposal_combined (the set of validators in the proposed block). This also removes the per-attestation hash_tree_root lookup on the block- building path and the cross-AttestationData payload_seen contamination the old classification carried. Default-seeded series drop from 18 to 14.
…gate-coverage-metrics # Conflicts: # crates/blockchain/src/store.rs
Key every coverage section by the attestation data.slot (voting round)
and fire the per-slot report at interval 1, matching zeam #876:
- timely: CoverageSnapshot now stores (data.slot, bits) per entry and the
report filters by data.slot, removing the non-deterministic slot picked
from HashMap iteration order.
- block: read from the canonical head block at report time (filtered to
data.slot) instead of an unconditionally-overwritten per-block snapshot,
so fork blocks cant poison it. Removes last_block_coverage from Store.
- trigger: emit at interval 1 (not 0) so the block carrying the reporting
slots votes has been received and processed.
- has_any gate: emit nothing when no section has coverage for the round,
instead of a misleading block_only=0 / timely_only=N reading.
The storage Store carried observability-only state for the attestation aggregate coverage report (the CoverageSnapshot type and a pre_merge_new_coverage field), which has no fork-choice or state-transition role. Move it into the blockchain crate behind a new PreMergeCoverage handle owned by the BlockChainServer actor, threaded through on_tick / accept_new_attestations / get_proposal_head / produce_block_with_signatures as an Option so the snapshot is captured at promotion and read at the next slot boundary exactly as before. Tests and the RPC test-driver pass None. Pure refactor, no metric behavior change.
… module Address review feedback: relocate all attestation aggregate coverage observability into a new crates/blockchain/src/coverage.rs. The pre-merge new_payloads snapshot is now captured by the actor in BlockChainServer::on_tick (when proposing or at the end of the slot) instead of being threaded as a parameter through store::on_tick / accept_new_attestations / get_proposal_head / produce_block_with_signatures. Since the snapshot is owned solely by the actor and only touched from its single-threaded message loop, the Arc<Mutex> handle is replaced by a plain Option<CoverageSnapshot> field. No behavior change - coverage emission is observability-only.
|
@MegaRedHand addressed all the review comments in c751426 — one refactor that pulls the coverage observability out of the storage/STF path:
No behavior change — coverage emission stays observability-only. |
MegaRedHand
left a comment
There was a problem hiding this comment.
Review of the attestation aggregate coverage metrics. All findings are in observability-only code (no fork-choice or state-transition impact), but two of them make the new metrics misleading under specific conditions. Inline comments below. Findings 1 and 2 are the ones worth addressing before merge; 3 is an efficiency cleanup and 5 is a schema/doc gap.
| # | Finding | Severity |
|---|---|---|
| 1 | Proposer interval-0 snapshot overwrites the reported round's timely cohort |
Medium |
| 2 | Emit-gate doesn't suppress the missed-slot case its comment promises | Medium-low |
| 3 | new_aggregated_payloads clones 1 MiB proof blobs to keep only bits (×2/slot) |
Medium (perf) |
| 5 | subnet=subnet_N series never written despite help text |
Low (doc) |
| // the end of the slot (interval 4); skip empty snapshots so a missed | ||
| // round keeps the last set we saw. Pure observability. | ||
| let will_promote = interval == 4 || (interval == 0 && proposer_validator_id.is_some()); | ||
| if will_promote && let Some(snapshot) = coverage::snapshot_new_payloads(&self.store) { |
There was a problem hiding this comment.
Finding 1 (Medium): proposer interval-0 snapshot can corrupt the timely cohort it's about to report.
Walking the timeline for a slot S this node proposes:
- Slot
S-1interval 4:snapshot_new_payloadscaptures roundS-1aggregates intopre_merge_coverage, thenstore::on_tickpromotes (drainsnew_payloads). - Between then and slot
Sinterval 0, gossiped aggregates arrive —store.rsinsert_new_aggregated_payloadpushes intonew_payloadsat arbitrary network-driven times. - Slot
Sinterval 0: since we're proposer,will_promotefires andsnapshot_new_payloadsreturnsSome(non-empty due to those stragglers), overwriting the goodS-1snapshot with a partial set. - Slot
Sinterval 1:emit_post_block_coveragefilters the overwritten snapshot todata.slot == S-1and finds only the stragglers.
Result: timely under-reports and diff_validators{block_only} is inflated, but only on slots this node proposes, so the bias is intermittent and proposer-correlated. The interval-0 snapshot doesn't actually serve the round it's reporting; consider only refreshing pre_merge_coverage when the new snapshot covers the round not yet reported, or skip the interval-0 capture entirely.
| // Emit nothing when there is no coverage for this round, rather than | ||
| // pushing a misleading all-zero report (e.g. `block_only=0, timely_only=N` | ||
| // on a missed slot). Gauges retain their previous value. | ||
| if !combined_v.iter().any(|&b| b) { |
There was a problem hiding this comment.
Finding 2 (Medium-low): this guard doesn't suppress the case the comment promises.
The comment above says it emits nothing to avoid "a misleading all-zero report (e.g. block_only=0, timely_only=N on a missed slot)". But the guard keys on combined, and combined = timely | late | block.
On a missed slot, store.get_block(store.head()) returns an older block whose att.data.slot != reporting_slot, so block_v is all-false — but the timely snapshot for S-1 is populated, so combined is non-empty and the report is emitted, pushing block_only=0, timely_only=N: exactly the case the comment claims to suppress.
To match the stated intent, gate on block_v (or require both block_v and timely_v non-empty) rather than combined_v.
| let buf = self.new_payloads.lock().unwrap(); | ||
| buf.data | ||
| .iter() | ||
| .map(|(root, entry)| (*root, (entry.data.clone(), entry.proofs.clone()))) |
There was a problem hiding this comment.
Finding 3 (Medium, perf): clones full proof blobs only to keep the participant bits.
entry.proofs.clone() deep-clones Vec<AggregatedSignatureProof>, and each proof carries proof_data: ByteList<1_048_576> (block.rs:88). The only consumers — snapshot_new_payloads and the late loop in emit_post_block_coverage — keep just proof.participants and drop the bytes. Worse, emit_post_block_coverage calls new_aggregated_payloads() a second time within the same tick, so a proposer-slot tick deep-clones the entire pending proof pool twice purely to read bitfields.
A participants-only accessor (e.g. returning Vec<(u64 /*data.slot*/, AggregationBits)>) eliminates both clones.
| fn cov_record(section: &str, seen: &[bool], has_subnet: &[bool]) { | ||
| metrics::set_attestation_aggregate_coverage_validators( | ||
| section, | ||
| "combined", |
There was a problem hiding this comment.
Finding 5 (Low, doc/schema): per-subnet subnet=subnet_N series are never written.
cov_record only ever calls set_attestation_aggregate_coverage_validators(section, "combined", ...). The has_subnet bitsets are computed but only feed the subnet-count gauge, never the per-subnet validator series. Yet the lean_attestation_aggregate_coverage_validators help text advertises "subnet=subnet_N is per-subnet coverage", and the gauge carries a subnet label dimension for it.
As written, querying subnet=subnet_N returns nothing — the schema overstates what's emitted. Either wire up per-subnet emission or trim the help text/label so dashboards don't expect a series that never appears.
Address MegaRedHand's review findings on the coverage observability: - Drop the proposer interval-0 pre-merge snapshot. By interval 0 the round's votes are already promoted, so new_payloads holds only stragglers; capturing them overwrote the good interval-4 snapshot and made the timely section under-report on slots this node proposes. Snapshot only at interval 4; stragglers now surface in the late section as intended. - Gate the post-block report on block_v (the round's votes appearing in the canonical head block) instead of combined_v. Gating on combined still fired on a missed slot, pushing the misleading block_only=0, timely_only=N the diff is meant to avoid. - Add Store::new_aggregated_payload_participants() returning only the participant bitfields, replacing new_aggregated_payloads() which deep-cloned the multi-megabyte proof_data blobs on every read just to extract bits. - Correct the validators-gauge help text: only subnet=combined is emitted; the per-subnet subnet_N breakdown is reserved and not yet populated. No fork-choice or state-transition change - coverage stays observability-only.
|
Thanks for the thorough pass @MegaRedHand — all four addressed in 6fcefec (rebased onto the latest Finding 1 (proposer interval-0 snapshot corrupts Finding 2 (emit-gate doesn't suppress the missed-slot case). Right — Finding 3 (1 MiB proof clones to read bits). Replaced Finding 5 (
|
Description / Motivation
Ports two upstream changes that together add attestation aggregate coverage observability:
After this PR, all 14 coverage series receive real per-slot updates from chain activity.
What Changed
Registration (
crates/blockchain/src/metrics.rs)pub const &[&str]label-set constants — single source of truth for sections and directions.IntGaugeVecstatics:lean_attestation_aggregate_coverage_validators{section, subnet}—subnet="combined"is the section total;subnet="subnet_N"is per-subnet coverage (lazy).lean_attestation_aggregate_coverage_subnets{section}— covered-subnet count per section.lean_attestation_aggregate_coverage_diff_validators{direction}— symmetric difference between block-included and locally-aggregated pre-merge aggregates.init()forces the statics and seeds 14 default zero-valued series.Sections:
timely,late,block,combined,agg_start_new,proposal_combined.Directions:
block_only,timely_only.Emission (4 sites)
Every per-slot section is keyed by the attestation
data.slot(the voting round being reported), sotimely/late/block/combineddescribe the same cohort (matching zeam #876).accept_new_attestations(blockchain/store.rs) — before each promote, snapshotsnew_payloadsparticipant bits, tagged per entry with theirdata.slot, stashed on the Store. Read at the next slot boundary to populate thetimelysection ("prev_new" in zeam).on_tickinterval 1 (blockchain/lib.rs) — emits the post-block report forslot - 1. Fired at interval 1, not 0, so the block carrying that round's votes (proposed at interval 0 of this slot) has typically been received and processed, letting theblocksection observe the same round. Computestimely(pre-merge snapshot filtered to the round),late(currentnew_payloadsfor the round),block(attestations in the canonical head block filtered to the round — canonical by construction, so a fork block can't poison it), andcombined(their union); then emitsdiff_validatorsas the symmetric difference betweenblockandtimely. Emits nothing when no section has coverage for the round, so a missed slot doesn't surface as a misleadingblock_only=0, timely_only=N.start_aggregation_session(blockchain/lib.rs) — emitsagg_start_newfrom currentnew_payloadsright before fork-choice aggregation at interval 2.propose_block(blockchain/lib.rs) — emitsproposal_combined(the full set of validators included across the block's aggregated attestations) after the block is built. (A payload-vs-gossip split was considered but dropped: ethlambda builds blocks exclusively fromknown_aggregated_payloads, so every included validator is payload-sourced by construction —proposal_gossipwould always be 0 andproposal_payloadswould always equalproposal_combined.)Supporting changes
fns at the bottom ofblockchain/src/lib.rs. They useVec<bool>locals (seenfor validators,has_subnetfor subnets) plus three small helpers (cov_add,cov_record,or_into). Subnet derivation isvid % committee_count, matching the gossip subnet assignment incrates/net/p2p/src/lib.rs.Storegets a singleCoverageSnapshotfield (pre_merge_new_coverage) holding(data.slot, AggregationBits)entries captured before each promote — only participant bits, no proof duplication. Theblocksection needs no stored state: it is read from the canonical head block at report time.BlockChain::spawnnow takesattestation_committee_count(resolved inmain.rsfrom CLI > validator-config.yaml > 1; previously only known to P2P for subnet subscriptions).Operator interpretation of
diff_validatorsThe aggregation pipeline produces two pools for the same slot:
timely(locally aggregated pre-merge) andblock(what the proposer included). The diff counts validators in the symmetric difference:block_onlypersistently high → this node was slow to receive/aggregate via gossip; proposer had a better view.timely_onlypersistently high → proposer omitted attestations the network had time to gossip.Correctness / Behavior Guarantees
diff_validatorshelp text intentionally diverges from upstream's terse phrasing to spell out the symmetric-difference semantics. Metric name, labels, and values are unchanged; dashboards built against any client's schema work as-is.6 × 65 = 390series.Tests Added / Run
No new unit tests. The earlier draft of this PR included unit tests for an internal
Coveragebitset wrapper; once the wrapper was inlined asVec<bool>locals, those tests were exercising trivial iterator ops rather than the emission contract, so they were dropped.Commands run:
cargo fmt --all -- --check— cleanmake lint(clippy-D warnings) — cleancargo test -p ethlambda-blockchain --release --lib --bins— 21 passedcargo test -p ethlambda-storage --release— 38 passedcargo test -p ethlambda-blockchain --release --test forkchoice_spectests— 62 passed, 8 failed (allAttestationTooFarInFuture, pre-existing fixture flakes onmain, unrelated to this PR).Related Issues / PRs
✅ Verification Checklist
make fmt— cleanmake lint(clippy with-D warnings) — cleancargo test --workspace --release— passing modulo 8 pre-existing forkchoice fixture flakes documented above (unrelated to this PR)