Tune scan alerts#6166
Conversation
44d227a to
f38205c
Compare
replace disagreement alert with alert that scan is down actually [static] Signed-off-by: Nicu Reut <nicu.reut@digitalasset.com>
f38205c to
07cb30d
Compare
There was a problem hiding this comment.
Pull request overview
This PR retunes “scan connection disagreement” alerting by replacing the prior generic disagreement warning with two more specific alerts: one for successful (2xx) responses that disagree with BFT consensus and one for failed (non-2xx) responses that disagree (intended to better signal a scan being down). It also simplifies the configuration model by switching to a single fractional threshold.
Changes:
- Replace the old disagreement alert with separate “successful disagree” (critical) and “failed disagree” (warning) Grafana rules.
- Update alert substitution/config to use a single
alertThreshold(fraction) and add placeholders for connection-level filtering. - Update resolved deployment configs and expected rendered observability output accordingly.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| cluster/pulumi/observability/src/observability.ts | Updates placeholder substitution logic for the revised scan disagreement alerts and new threshold field. |
| cluster/pulumi/observability/src/config.ts | Updates monitoring config schema to replace prior thresholds with alertThreshold and retains request/connection exclusion lists. |
| cluster/pulumi/observability/grafana-alerting/scan_connection_disagreement_alerts.yaml | Replaces old warning rule with new success/failure disagreement alert rules and updated PromQL/annotations. |
| cluster/pulumi/observability/grafana-alerting/deleted.yaml | Adds deletion entry for the old scan disagreement alert UID. |
| cluster/expected/observability/expected.json | Updates expected rendered Grafana alerting YAML content. |
| cluster/deployment/scratchnete/config.resolved.yaml | Migrates scan disagreement config to alertThreshold. |
| cluster/deployment/scratchnetd/config.resolved.yaml | Migrates scan disagreement config to alertThreshold. |
| cluster/deployment/scratchnetc/config.resolved.yaml | Migrates scan disagreement config to alertThreshold. |
| cluster/deployment/scratchnetb/config.resolved.yaml | Migrates scan disagreement config to alertThreshold. |
| cluster/deployment/scratchneta/config.resolved.yaml | Migrates scan disagreement config to alertThreshold. |
| cluster/configs/shared/base.yaml | Migrates shared base scan disagreement config to alertThreshold. |
| // Fraction (0-1) of BFT consensus comparisons on a scan connection that may | ||
| // disagree with the consensus result before the warning alert fires. | ||
| disagreementRateThreshold: z.number(), | ||
| // Number of successful (2xx) responses that disagree with BFT consensus that | ||
| // may occur before the critical alert fires. | ||
| successfulDisagreementThreshold: z.number(), | ||
| // return a response (successful or failed) disagreeing with the consensus | ||
| // result before the success/failure alerts fire. | ||
| alertThreshold: z.number(), |
| - orgId: 1 | ||
| uid: aescandisagree1 |
| summary: Scan connection {{ $labels.scan_connection }} of job {{ $labels.job }} (node {{ $labels.node_name }}) in namespace {{ $labels.namespace }} disagreed with BFT consensus on request {{ $labels.request }} on {{ index $values "A" }} of comparisons in the last 30m (threshold $SCAN_DISAGREEMENT_RATE_THRESHOLD_PERCENT%). | ||
| description: More than $SCAN_DISAGREEMENT_SUCCESS_THRESHOLD_PERCENT% of the BFT consensus comparisons for this request on this scan connection returned a successful (2xx) response that disagreed with the BFT consensus result over the last 30 minutes. | ||
| severity: critical | ||
| summary: Scan connection {{ $labels.scan_connection }} of job {{ $labels.job }} (node {{ $labels.node_name }}) in namespace {{ $labels.namespace }} returned successful responses disagreeing with BFT consensus on {{ index $values "A" }} of comparisons for request {{ $labels.request }} in the last 30m (threshold $SCAN_DISAGREEMENT_SUCCESS_THRESHOLD_PERCENT%). |
There was a problem hiding this comment.
Nits: this message in general is too long, and a bit hard to follow. Would it make sense to have 2 shorter but clearer sentences?
There was a problem hiding this comment.
Always open to suggestions
replace disagreement alert with alert that scan is down actually
[static]
Pull Request Checklist
Cluster Testing
/cluster_teston this PR to request it, and ping someone with access to the DA-internal system to approve it./upgrade_teston this PR to request it, and ping someone with access to the DA-internal system to approve it./hdm_teston this PR to request it, and ping someone with access to the DA-internal system to approve it./lsu_teston this PR to request it, and ping someone with access to the DA-internal system to approve it.PR Guidelines
Fixes #n, and mention issues worked on using#nMerge Guidelines