Skip to content

Tune scan alerts#6166

Merged
nicu-da merged 1 commit into
mainfrom
nicu/chore/scan_alerts
Jul 1, 2026
Merged

Tune scan alerts#6166
nicu-da merged 1 commit into
mainfrom
nicu/chore/scan_alerts

Conversation

@nicu-da

@nicu-da nicu-da commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

replace disagreement alert with alert that scan is down actually

[static]

Pull Request Checklist

Cluster Testing

  • If a cluster test is required, comment /cluster_test on this PR to request it, and ping someone with access to the DA-internal system to approve it.
  • If an upgrade test is required, comment /upgrade_test on this PR to request it, and ping someone with access to the DA-internal system to approve it.
  • If a hard-migration test is required (from the latest release), comment /hdm_test on this PR to request it, and ping someone with access to the DA-internal system to approve it.
  • If a logical synchronizer upgrade test is required (from canton-3.5), comment /lsu_test on this PR to request it, and ping someone with access to the DA-internal system to approve it.

PR Guidelines

  • Include any change that might be observable by our partners or affect their deployment in the release notes.
  • Specify fixed issues with Fixes #n, and mention issues worked on using #n
  • Include a screenshot for frontend-related PRs - see README or use your favorite screenshot tool

Merge Guidelines

  • Make the git commit message look sensible when squash-merging on GitHub (most likely: just copy your PR description).

@nicu-da nicu-da force-pushed the nicu/chore/scan_alerts branch 5 times, most recently from 44d227a to f38205c Compare June 30, 2026 12:21
replace disagreement alert with alert that scan is down actually

[static]

Signed-off-by: Nicu Reut <nicu.reut@digitalasset.com>
@nicu-da nicu-da force-pushed the nicu/chore/scan_alerts branch from f38205c to 07cb30d Compare June 30, 2026 12:33
@nicu-da nicu-da marked this pull request as ready for review June 30, 2026 12:47

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR retunes “scan connection disagreement” alerting by replacing the prior generic disagreement warning with two more specific alerts: one for successful (2xx) responses that disagree with BFT consensus and one for failed (non-2xx) responses that disagree (intended to better signal a scan being down). It also simplifies the configuration model by switching to a single fractional threshold.

Changes:

  • Replace the old disagreement alert with separate “successful disagree” (critical) and “failed disagree” (warning) Grafana rules.
  • Update alert substitution/config to use a single alertThreshold (fraction) and add placeholders for connection-level filtering.
  • Update resolved deployment configs and expected rendered observability output accordingly.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
cluster/pulumi/observability/src/observability.ts Updates placeholder substitution logic for the revised scan disagreement alerts and new threshold field.
cluster/pulumi/observability/src/config.ts Updates monitoring config schema to replace prior thresholds with alertThreshold and retains request/connection exclusion lists.
cluster/pulumi/observability/grafana-alerting/scan_connection_disagreement_alerts.yaml Replaces old warning rule with new success/failure disagreement alert rules and updated PromQL/annotations.
cluster/pulumi/observability/grafana-alerting/deleted.yaml Adds deletion entry for the old scan disagreement alert UID.
cluster/expected/observability/expected.json Updates expected rendered Grafana alerting YAML content.
cluster/deployment/scratchnete/config.resolved.yaml Migrates scan disagreement config to alertThreshold.
cluster/deployment/scratchnetd/config.resolved.yaml Migrates scan disagreement config to alertThreshold.
cluster/deployment/scratchnetc/config.resolved.yaml Migrates scan disagreement config to alertThreshold.
cluster/deployment/scratchnetb/config.resolved.yaml Migrates scan disagreement config to alertThreshold.
cluster/deployment/scratchneta/config.resolved.yaml Migrates scan disagreement config to alertThreshold.
cluster/configs/shared/base.yaml Migrates shared base scan disagreement config to alertThreshold.

Comment on lines 108 to +111
// Fraction (0-1) of BFT consensus comparisons on a scan connection that may
// disagree with the consensus result before the warning alert fires.
disagreementRateThreshold: z.number(),
// Number of successful (2xx) responses that disagree with BFT consensus that
// may occur before the critical alert fires.
successfulDisagreementThreshold: z.number(),
// return a response (successful or failed) disagreeing with the consensus
// result before the success/failure alerts fire.
alertThreshold: z.number(),
Comment on lines +18 to +19
- orgId: 1
uid: aescandisagree1
summary: Scan connection {{ $labels.scan_connection }} of job {{ $labels.job }} (node {{ $labels.node_name }}) in namespace {{ $labels.namespace }} disagreed with BFT consensus on request {{ $labels.request }} on {{ index $values "A" }} of comparisons in the last 30m (threshold $SCAN_DISAGREEMENT_RATE_THRESHOLD_PERCENT%).
description: More than $SCAN_DISAGREEMENT_SUCCESS_THRESHOLD_PERCENT% of the BFT consensus comparisons for this request on this scan connection returned a successful (2xx) response that disagreed with the BFT consensus result over the last 30 minutes.
severity: critical
summary: Scan connection {{ $labels.scan_connection }} of job {{ $labels.job }} (node {{ $labels.node_name }}) in namespace {{ $labels.namespace }} returned successful responses disagreeing with BFT consensus on {{ index $values "A" }} of comparisons for request {{ $labels.request }} in the last 30m (threshold $SCAN_DISAGREEMENT_SUCCESS_THRESHOLD_PERCENT%).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nits: this message in general is too long, and a bit hard to follow. Would it make sense to have 2 shorter but clearer sentences?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always open to suggestions

@nicu-da nicu-da merged commit 002019a into main Jul 1, 2026
49 checks passed
@nicu-da nicu-da deleted the nicu/chore/scan_alerts branch July 1, 2026 10:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants