Skip to content

Commit 9a96af6

Browse files
Merge branch 'development' into alex_readme_nodestatus
2 parents 5fe4fb5 + 4dc5981 commit 9a96af6

9 files changed

Lines changed: 313 additions & 48 deletions

File tree

docs/PLUGIN_DOC.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
| Plugin | Collection | Analyzer Args | Collection Args | DataModel | Collector | Analyzer |
66
| --- | --- | --- | --- | --- | --- | --- |
7-
| AmdSmiPlugin | bad-pages<br>firmware --json<br>list --json<br>metric -g all<br>partition --json<br>process --json<br>ras --cper --folder={folder}<br>ras --afid --cper-file {cper_file}<br>static -g all --json<br>static -g {gpu_id} --json<br>topology<br>version --json<br>xgmi -l<br>xgmi -m | **Analyzer Args:**<br>- `check_static_data`: bool — If True, run static data checks (e.g. driver version, partition mode).<br>- `expected_gpu_processes`: Optional[int] — Expected number of GPU processes.<br>- `expected_max_power`: Optional[int] — Expected maximum power value (e.g. watts).<br>- `expected_driver_version`: Optional[str] — Expected AMD driver version string.<br>- `expected_memory_partition_mode`: Optional[str] — Expected memory partition mode (e.g. sp3, dp).<br>- `expected_compute_partition_mode`: Optional[str] — Expected compute partition mode.<br>- `expected_pldm_version`: Optional[str] — Expected PLDM version string.<br>- `l0_to_recovery_count_error_threshold`: Optional[int] — L0-to-recovery count above which an error is raised.<br>- `l0_to_recovery_count_warning_threshold`: Optional[int] — L0-to-recovery count above which a warning is raised.<br>- `vendorid_ep`: Optional[str] — Expected endpoint vendor ID (e.g. for PCIe).<br>- `vendorid_ep_vf`: Optional[str] — Expected endpoint VF vendor ID.<br>- `devid_ep`: Optional[str] — Expected endpoint device ID.<br>- `devid_ep_vf`: Optional[str] — Expected endpoint VF device ID.<br>- `sku_name`: Optional[str] — Expected SKU name string for GPU.<br>- `expected_xgmi_speed`: Optional[list[float]] — Expected xGMI speed value(s) (e.g. link rate).<br>- `analysis_range_start`: Optional[datetime.datetime] — Start of time range for time-windowed analysis.<br>- `analysis_range_end`: Optional[datetime.datetime] — End of time range for time-windowed analysis. | **Collection Args:**<br>- `cper_file_path`: Optional[str] — Path to CPER folder or file for RAS AFID collection (ras --afid --cper-file). | [AmdSmiDataModel](#AmdSmiDataModel-Model) | [AmdSmiCollector](#Collector-Class-AmdSmiCollector) | [AmdSmiAnalyzer](#Data-Analyzer-Class-AmdSmiAnalyzer) |
7+
| AmdSmiPlugin | bad-pages<br>firmware --json<br>list --json<br>metric -g all<br>partition --json<br>process --json<br>ras --cper --folder={folder}<br>ras --afid --cper-file {cper_file}<br>static -g all --json<br>static -g {gpu_id} --json<br>topology<br>version --json<br>xgmi -l<br>xgmi -m | **Analyzer Args:**<br>- `check_static_data`: bool — If True, run static data checks (e.g. driver version, partition mode).<br>- `expected_gpu_processes`: Optional[int] — Expected number of GPU processes.<br>- `expected_max_power`: Optional[int] — Expected maximum power value (e.g. watts).<br>- `expected_driver_version`: Optional[str] — Expected AMD driver version string.<br>- `expected_memory_partition_mode`: Optional[str] — Expected memory partition mode (e.g. sp3, dp).<br>- `expected_compute_partition_mode`: Optional[str] — Expected compute partition mode.<br>- `expected_firmware_versions`: Optional[dict[str, str]] — Expected firmware versions keyed by amd-smi fw_id (e.g. PLDM_BUNDLE).<br>- `l0_to_recovery_count_error_threshold`: Optional[int] — L0-to-recovery count above which an error is raised.<br>- `l0_to_recovery_count_warning_threshold`: Optional[int] — L0-to-recovery count above which a warning is raised.<br>- `vendorid_ep`: Optional[str] — Expected endpoint vendor ID (e.g. for PCIe).<br>- `vendorid_ep_vf`: Optional[str] — Expected endpoint VF vendor ID.<br>- `devid_ep`: Optional[str] — Expected endpoint device ID.<br>- `devid_ep_vf`: Optional[str] — Expected endpoint VF device ID.<br>- `sku_name`: Optional[str] — Expected SKU name string for GPU.<br>- `expected_xgmi_speed`: Optional[list[float]] — Expected xGMI speed value(s) (e.g. link rate).<br>- `analysis_range_start`: Optional[datetime.datetime] — Start of time range for time-windowed analysis.<br>- `analysis_range_end`: Optional[datetime.datetime] — End of time range for time-windowed analysis. | **Collection Args:**<br>- `cper_file_path`: Optional[str] — Path to CPER folder or file for RAS AFID collection (ras --afid --cper-file). | [AmdSmiDataModel](#AmdSmiDataModel-Model) | [AmdSmiCollector](#Collector-Class-AmdSmiCollector) | [AmdSmiAnalyzer](#Data-Analyzer-Class-AmdSmiAnalyzer) |
88
| BiosPlugin | sh -c 'cat /sys/devices/virtual/dmi/id/bios_version'<br>wmic bios get SMBIOSBIOSVersion /Value | **Analyzer Args:**<br>- `exp_bios_version`: list[str] — Expected BIOS version(s) to match against collected value (str or list).<br>- `regex_match`: bool — If True, match exp_bios_version as regex; otherwise exact match. | - | [BiosDataModel](#BiosDataModel-Model) | [BiosCollector](#Collector-Class-BiosCollector) | [BiosAnalyzer](#Data-Analyzer-Class-BiosAnalyzer) |
99
| CmdlinePlugin | cat /proc/cmdline | **Analyzer Args:**<br>- `required_cmdline`: Union[str, List] — Command-line parameters that must be present (e.g. 'pci=bfsort').<br>- `banned_cmdline`: Union[str, List] — Command-line parameters that must not be present.<br>- `os_overrides`: Dict[str, nodescraper.plugins.inband.cmdline.cmdlineconfig.OverrideConfig] — Per-OS overrides for required_cmdline and banned_cmdline (keyed by OS identifier).<br>- `platform_overrides`: Dict[str, nodescraper.plugins.inband.cmdline.cmdlineconfig.OverrideConfig] — Per-platform overrides for required_cmdline and banned_cmdline (keyed by platform). | - | [CmdlineDataModel](#CmdlineDataModel-Model) | [CmdlineCollector](#Collector-Class-CmdlineCollector) | [CmdlineAnalyzer](#Data-Analyzer-Class-CmdlineAnalyzer) |
1010
| DeviceEnumerationPlugin | powershell -Command "(Get-WmiObject -Class Win32_Processor &#124; Measure-Object).Count"<br>lspci -d {vendorid_ep}: &#124; grep -i 'VGA\&#124;Display\&#124;3D' &#124; wc -l<br>powershell -Command "(wmic path win32_VideoController get name &#124; findstr AMD &#124; Measure-Object).Count"<br>lscpu<br>lshw<br>lspci -d {vendorid_ep}: &#124; grep -i 'Virtual Function' &#124; wc -l<br>powershell -Command "(Get-VMHostPartitionableGpu &#124; Measure-Object).Count" | **Analyzer Args:**<br>- `cpu_count`: Optional[list[int]] — Expected CPU count(s); pass as int or list of ints. Analysis passes if actual is in list.<br>- `gpu_count`: Optional[list[int]] — Expected GPU count(s); pass as int or list of ints. Analysis passes if actual is in list.<br>- `vf_count`: Optional[list[int]] — Expected virtual function count(s); pass as int or list of ints. Analysis passes if actual is in list. | - | [DeviceEnumerationDataModel](#DeviceEnumerationDataModel-Model) | [DeviceEnumerationCollector](#Collector-Class-DeviceEnumerationCollector) | [DeviceEnumerationAnalyzer](#Data-Analyzer-Class-DeviceEnumerationAnalyzer) |
@@ -1843,7 +1843,7 @@ Check sysctl matches expected sysctl details
18431843
- **expected_driver_version**: `Optional[str]` — Expected AMD driver version string.
18441844
- **expected_memory_partition_mode**: `Optional[str]` — Expected memory partition mode (e.g. sp3, dp).
18451845
- **expected_compute_partition_mode**: `Optional[str]` — Expected compute partition mode.
1846-
- **expected_pldm_version**: `Optional[str]` — Expected PLDM version string.
1846+
- **expected_firmware_versions**: `Optional[dict[str, str]]` — Expected firmware versions keyed by amd-smi fw_id (e.g. PLDM_BUNDLE).
18471847
- **l0_to_recovery_count_error_threshold**: `Optional[int]` — L0-to-recovery count above which an error is raised.
18481848
- **l0_to_recovery_count_warning_threshold**: `Optional[int]` — L0-to-recovery count above which a warning is raised.
18491849
- **vendorid_ep**: `Optional[str]` — Expected endpoint vendor ID (e.g. for PCIe).

nodescraper/plugins/inband/amdsmi/amdsmi_analyzer.py

Lines changed: 33 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -534,18 +534,14 @@ def _format_static_mismatch_payload(
534534
"per_gpu": per_gpu_list,
535535
}
536536

537-
def check_pldm_version(
537+
def check_firmware_versions(
538538
self,
539539
amdsmi_fw_data: Optional[list[Fw]],
540-
expected_pldm_version: Optional[str],
541-
):
542-
"""Check expected pldm version
543-
544-
Args:
545-
amdsmi_fw_data (Optional[list[Fw]]): data model
546-
expected_pldm_version (Optional[str]): expected pldm version
547-
"""
548-
PLDM_STRING = "PLDM_BUNDLE"
540+
expected_firmware_versions: dict[str, str],
541+
) -> None:
542+
"""Check that each GPU reports the expected version for each ``fw_id``."""
543+
if not expected_firmware_versions:
544+
return
549545
if amdsmi_fw_data is None or len(amdsmi_fw_data) == 0:
550546
self._log_event(
551547
category=EventCategory.PLATFORM,
@@ -554,30 +550,37 @@ def check_pldm_version(
554550
data={"amdsmi_fw_data": amdsmi_fw_data},
555551
)
556552
return
557-
mismatched_gpus: list[int] = []
558-
pldm_missing_gpus: list[int] = []
553+
mismatches: list[dict[str, object]] = []
554+
missing: list[dict[str, object]] = []
559555
for fw_data in amdsmi_fw_data:
560556
gpu = fw_data.gpu
561557
if isinstance(fw_data.fw_list, str):
562-
pldm_missing_gpus.append(gpu)
558+
for fw_id in expected_firmware_versions:
559+
missing.append({"gpu": gpu, "fw_id": fw_id})
563560
continue
564-
for fw_info in fw_data.fw_list:
565-
if PLDM_STRING == fw_info.fw_id and expected_pldm_version != fw_info.fw_version:
566-
mismatched_gpus.append(gpu)
567-
if PLDM_STRING == fw_info.fw_id:
568-
break
569-
else:
570-
pldm_missing_gpus.append(gpu)
561+
actual_by_id = {item.fw_id: item.fw_version for item in fw_data.fw_list}
562+
for fw_id, expected_ver in expected_firmware_versions.items():
563+
if fw_id not in actual_by_id:
564+
missing.append({"gpu": gpu, "fw_id": fw_id})
565+
elif actual_by_id[fw_id] != expected_ver:
566+
mismatches.append(
567+
{
568+
"gpu": gpu,
569+
"fw_id": fw_id,
570+
"expected": expected_ver,
571+
"actual": actual_by_id[fw_id],
572+
}
573+
)
571574

572-
if mismatched_gpus or pldm_missing_gpus:
575+
if mismatches or missing:
573576
self._log_event(
574577
category=EventCategory.FW,
575-
description="PLDM Version Mismatch",
578+
description="Firmware version mismatch",
576579
priority=EventPriority.ERROR,
577580
data={
578-
"mismatched_gpus": mismatched_gpus,
579-
"pldm_missing_gpus": pldm_missing_gpus,
580-
"expected_pldm_version": expected_pldm_version,
581+
"expected_firmware_versions": expected_firmware_versions,
582+
"mismatches": mismatches,
583+
"missing": missing,
581584
},
582585
)
583586

@@ -661,8 +664,9 @@ def check_expected_xgmi_link_speed(
661664
if expected_xgmi_speed is None or len(expected_xgmi_speed) == 0:
662665
self._log_event(
663666
category=EventCategory.IO,
664-
description="Expected XGMI speed not configured, skipping XGMI link speed check",
665-
priority=EventPriority.WARNING,
667+
description=("Expected XGMI link speed not set; skipping XGMI link speed analysis"),
668+
priority=EventPriority.INFO,
669+
console_log=True,
666670
)
667671
return
668672

@@ -778,8 +782,8 @@ def analyze_data(
778782
args.expected_compute_partition_mode,
779783
)
780784

781-
if args.expected_pldm_version:
782-
self.check_pldm_version(data.firmware, args.expected_pldm_version)
785+
if args.expected_firmware_versions:
786+
self.check_firmware_versions(data.firmware, args.expected_firmware_versions)
783787

784788
if data.cper_data:
785789
self.analyzer_cpers(

nodescraper/plugins/inband/amdsmi/amdsmi_collector.py

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -475,7 +475,8 @@ def _get_amdsmi_data(
475475
return None
476476

477477
try:
478-
return AmdSmiDataModel(
478+
fw_ids = args.analysis_firmware_ids if args and args.analysis_firmware_ids else None
479+
base = AmdSmiDataModel(
479480
version=version,
480481
gpu_list=gpu_list,
481482
process=processes,
@@ -489,7 +490,10 @@ def _get_amdsmi_data(
489490
xgmi_link=xgmi_link or [],
490491
cper_data=cper_data,
491492
cper_afids=cper_afids,
493+
analysis_firmware_ids=fw_ids,
494+
analysis_ref=None,
492495
)
496+
return base.model_copy(update={"analysis_ref": base.build_analysis_ref()})
493497
except ValidationError as err:
494498
self.logger.warning("Validation err: %s", err)
495499
self._log_event(
@@ -763,7 +767,9 @@ def get_firmware(self) -> Optional[list[Fw]]:
763767
normalized: list[FwListItem] = []
764768
for e in fw_list_raw:
765769
if isinstance(e, dict):
766-
fid = e.get("fw_name")
770+
fid = e.get("fw_id")
771+
if fid is None:
772+
fid = e.get("fw_name")
767773
ver = e.get("fw_version")
768774
normalized.append(
769775
FwListItem(

0 commit comments

Comments
 (0)