Skip to content

HDDS-14954. Add logging and metric for SCM safemode duration.#10018

Draft
sadanand48 wants to merge 3 commits intoapache:masterfrom
sadanand48:HDDS-14954
Draft

HDDS-14954. Add logging and metric for SCM safemode duration.#10018
sadanand48 wants to merge 3 commits intoapache:masterfrom
sadanand48:HDDS-14954

Conversation

@sadanand48
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Add a metric for :

  1. Time SCM took to exit safemode.
  2. Total time taken by the refresh call in ContainerSafemodeRule.
    We need this because in large clusters it is observed that refresh takes a lot of time. This can be used for monitoring.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14954

@errose28
Copy link
Copy Markdown
Contributor

errose28 commented Apr 6, 2026

Thanks for adding this @sadanand48. Let's add new safemode metrics to the safemode dashboard as well.

Copy link
Copy Markdown
Contributor

@sreejasahithi sreejasahithi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sadanand48 for working on this.
Could you please add the Grafana dashboard screenshots to the PR description.

Comment on lines +62 to +68
@Metric("Wall-clock time (ms) SCM spent in safe mode for the last exit")
private MutableGaugeLong scmSafeModeExitDurationMs;
@Metric("Duration (ms) of the last Ratis container safe mode rule incremental refresh")
private MutableGaugeLong lastRatisContainerSafeModeRuleRefreshDurationMs;
@Metric("Duration (ms) of the last EC container safe mode rule incremental refresh")
private MutableGaugeLong lastEcContainerSafeModeRuleRefreshDurationMs;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add test for these metrics. Since the existing test infrastructure already sets up everything needed, coverage can be added with just a line or two in existing tests rather than writing new ones.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants