Skip to content

Commit 9e838c5

Browse files
committed
Added S.M.A.R.T. inventory
This module collects S.M.A.R.T. data from storage devices and exposes it as inventory attributes in Mission Portal. It monitors drive health status, temperature, power-on hours, and NVMe-specific metrics. Ticket: CFE-4653
1 parent 9eff4e3 commit 9e838c5

4 files changed

Lines changed: 366 additions & 0 deletions

File tree

cfbs.json

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -196,6 +196,16 @@
196196
"bundles inventory_fde:main"
197197
]
198198
},
199+
"inventory-smartctl": {
200+
"description": "Inventory SMART drive health, temperature, and wear data.",
201+
"tags": ["inventory", "monitoring", "hardware", "storage"],
202+
"subdirectory": "inventory/inventory-smartctl",
203+
"steps": [
204+
"copy policy.cf services/cfbs/modules/inventory-smartctl/policy.cf",
205+
"policy_files services/cfbs/modules/inventory-smartctl/policy.cf",
206+
"bundles inventory_smartctl:main"
207+
]
208+
},
199209
"library-for-promise-types-in-bash": {
200210
"description": "Library enabling promise types implemented in bash.",
201211
"subdirectory": "libraries/bash",
Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
Inventory module for collecting SMART drive health, temperature, and wear data via smartctl.
2+
3+
## Description
4+
5+
This module collects S.M.A.R.T. (Self-Monitoring, Analysis and Reporting Technology) data from storage devices and exposes it as inventory attributes in CFEngine Mission Portal. It monitors drive health status, temperature, power-on hours, and NVMe-specific metrics.
6+
7+
SMART data helps predict drive failures before they occur and provides visibility into storage device health across your infrastructure.
8+
9+
## Requirements
10+
11+
- **Platform:** Linux only (currently)
12+
- **Binary:** `smartctl` from smartmontools package (version 7.0+ for JSON support)
13+
- **Permissions:** Requires root to read SMART data from devices
14+
15+
### Installation
16+
17+
Add to your policy via cfbs:
18+
19+
```bash
20+
cfbs add inventory-smartctl
21+
cfbs install
22+
```
23+
24+
Or include directly in your policy:
25+
26+
```cfengine
27+
bundle agent main
28+
{
29+
methods:
30+
"smartctl" usebundle => inventory_smartctl:main;
31+
}
32+
```
33+
34+
## Inventory Attributes
35+
36+
The following attributes are exposed in Mission Portal:
37+
38+
### Universal Attributes (all drive types)
39+
40+
- **SMART drive health** - Per-drive health status
41+
- Values: `PASSED`, `FAILED`, `SMARTCTL_MISSING`
42+
- Example: `/dev/sda: PASSED`, `/dev/nvme0: FAILED`
43+
- `SMARTCTL_MISSING`: Indicates smartctl is not installed on the system
44+
- Critical: A FAILED status indicates the drive is predicting imminent failure
45+
46+
- **SMART drive model** - Drive model identifier
47+
- Example: `/dev/sda: Samsung SSD 870 EVO`
48+
49+
- **SMART drive temperatures (C)** - Current temperature in Celsius
50+
- Example: `/dev/sda: 35 C`
51+
- Note: Not available for virtual disks
52+
53+
- **SMART drive power-on hours** - Cumulative runtime in hours
54+
- Example: `/dev/sda: 8742 h`
55+
- Useful for tracking drive age and warranty coverage
56+
57+
### NVMe-Specific Attributes
58+
59+
- **SMART NVMe available spare** - Remaining spare blocks (%)
60+
- Example: `/dev/nvme0: 100%`
61+
- Low values (<10%) indicate wear approaching end of life
62+
63+
- **SMART NVMe percentage used** - Drive life consumed (%)
64+
- Example: `/dev/nvme0: 5%`
65+
- Based on manufacturer's endurance rating
66+
67+
- **SMART NVMe media errors** - Uncorrectable media errors count
68+
- Example: `/dev/nvme0: 0`
69+
- Any non-zero value indicates data integrity issues
70+
71+
### Alert Attributes
72+
73+
- **SMART failed drives** - List of drives with FAILED health status
74+
- Only present when one or more drives are failing
75+
- Use for alerting and automated response
76+
77+
## Troubleshooting
78+
79+
### SMARTCTL_MISSING appears in inventory
80+
81+
The module reports `SMARTCTL_MISSING` when smartctl is not installed. To resolve:
82+
83+
**Install smartmontools package:**
84+
85+
```sh
86+
# Debian/Ubuntu
87+
apt-get install smartmontools
88+
89+
# RHEL/CentOS/Fedora
90+
yum install smartmontools
91+
92+
# SUSE
93+
zypper install smartmontools
94+
```
95+
96+
**Verify installation:**
97+
98+
```sh
99+
command -v smartctl
100+
smartctl --version
101+
```
102+
103+
### No inventory data appears
104+
105+
If smartctl is installed but no data appears:
106+
107+
**Check if drives are detected:**
108+
109+
```sh
110+
smartctl --scan
111+
```
112+
113+
**Check cache files:**
114+
115+
```sh
116+
ls -lh /var/cfengine/state/inventory_smartctl_*.json
117+
```
118+
119+
**Run with verbose mode:**
120+
121+
```sh
122+
cf-agent -Kvf ./policy.cf
123+
```
124+
125+
## See Also
126+
127+
- [CFEngine inventory tutorial](https://docs.cfengine.com/docs/lts/examples/tutorials/custom_inventory/)
128+
- [CFEngine Masterfiles inventory policy](https://docs.cfengine.com/docs/lts/reference/masterfiles-policy-framework/inventory/)
129+
- [smartmontools documentation](https://www.smartmontools.org/)
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
{
2+
"name": "inventory-smartctl",
3+
"description": "Inventory SMART drive health, temperature, and wear data",
4+
"tags": ["inventory", "monitoring", "hardware", "storage", "smartctl"],
5+
"version": "0.1.0",
6+
"steps": [
7+
"copy ./policy.cf services/inventory/smartctl.cf"
8+
],
9+
"dependencies": [],
10+
"subdirectory": "inventory/inventory-smartctl"
11+
}
Lines changed: 216 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,216 @@
1+
body file control
2+
{
3+
namespace => "inventory_smartctl";
4+
}
5+
6+
bundle agent main
7+
# @brief Inventory SMART drive health, temperature, and wear data via smartctl JSON
8+
#
9+
# Requires smartmontools >= 7.0 (for JSON output support).
10+
# Runs on Linux only; silently no-ops on other platforms.
11+
#
12+
# Simplified version: reads JSON directly in main bundle, no sub-bundle needed.
13+
#
14+
# Attributes exposed in Mission Portal:
15+
# @inventory SMART drive health - Per-drive PASSED/FAILED
16+
# @inventory SMART drive model - Drive model per device
17+
# @inventory SMART drive temperatures - Current temperature in Celsius
18+
# @inventory SMART drive power-on hours - Cumulative runtime in hours
19+
# @inventory SMART NVMe available spare - Remaining spare blocks (%), NVMe only
20+
# @inventory SMART NVMe percentage used - Drive life consumed (%), NVMe only
21+
# @inventory SMART NVMe media errors - Uncorrectable media errors, NVMe only
22+
# @inventory SMART failed drives - Only present on hosts with a failing drive
23+
{
24+
vars:
25+
linux::
26+
"_smartctl" string => ifelse(
27+
fileexists("/usr/sbin/smartctl"), "/usr/sbin/smartctl",
28+
fileexists("/sbin/smartctl"), "/sbin/smartctl",
29+
"/usr/sbin/smartctl" # default fallback
30+
);
31+
"_sdir" string => "$(sys.statedir)";
32+
"_cache_ttl" string => "3600"; # 1 hour
33+
34+
# Enumerate drives - extract first field from each line of smartctl --scan
35+
"_scan_lines"
36+
slist => splitstring(
37+
execresult("$(_smartctl) --scan 2>/dev/null", "useshell"),
38+
"\n", 32);
39+
40+
"_drives"
41+
slist => maplist(regex_replace("$(this)", "^(\S+).*", "\1", ""), "_scan_lines");
42+
43+
"_id[${_drives}]" string => canonify("${_drives}");
44+
"_cache[${_drives}]" string => "$(_sdir)/inventory_smartctl_${_id[${_drives}]}.json";
45+
46+
classes:
47+
linux::
48+
"_have_smartctl" expression => isexecutable("$(_smartctl)");
49+
50+
# Cache file is missing - needs refresh
51+
"_cache_missing_${_id[${_drives}]}"
52+
not => fileexists("${_cache[${_drives}]}");
53+
54+
# Cache file is stale - needs refresh
55+
"_cache_stale_${_id[${_drives}]}"
56+
expression => isgreaterthan(
57+
eval("$(sys.systime) - $(filestat(${_cache[${_drives}]}, mtime))"),
58+
"$(_cache_ttl)"),
59+
if => fileexists("${_cache[${_drives}]}");
60+
61+
# Refresh if missing or stale
62+
"_refresh_${_id[${_drives}]}"
63+
or => {
64+
"_cache_missing_${_id[${_drives}]}",
65+
"_cache_stale_${_id[${_drives}]}"
66+
};
67+
68+
files:
69+
linux._have_smartctl::
70+
"${_cache[${_drives}]}"
71+
content => execresult("$(_smartctl) -j -a ${_drives}", "noshell", "stdout"),
72+
if => "_refresh_${_id[${_drives}]}";
73+
74+
methods:
75+
linux._have_smartctl::
76+
# Call parsing bundle for each drive (only when cache exists)
77+
"parse_${_id[${_drives}]}"
78+
usebundle => parse("${_drives}", "${_cache[${_drives}]}"),
79+
useresult => "_d_${_id[${_drives}]}",
80+
if => fileexists("${_cache[${_drives}]}");
81+
82+
vars:
83+
linux._have_smartctl::
84+
# Collect results from sub-bundles into formatted entries
85+
"_health_entries[${_drives}]"
86+
string => "${_drives}: ${_d_${_id[${_drives}]}[health]}",
87+
if => isvariable("_d_${_id[${_drives}]}[health]");
88+
89+
"_model_entries[${_drives}]"
90+
string => "${_drives}: ${_d_${_id[${_drives}]}[model]}",
91+
if => isvariable("_d_${_id[${_drives}]}[model]");
92+
93+
"_temp_entries[${_drives}]"
94+
string => "${_drives}: ${_d_${_id[${_drives}]}[temp]} C",
95+
if => isvariable("_d_${_id[${_drives}]}[temp]");
96+
97+
"_hours_entries[${_drives}]"
98+
string => "${_drives}: ${_d_${_id[${_drives}]}[hours]} h",
99+
if => isvariable("_d_${_id[${_drives}]}[hours]");
100+
101+
"_nvme_spare_entries[${_drives}]"
102+
string => "${_drives}: ${_d_${_id[${_drives}]}[nvme_spare]}%",
103+
if => isvariable("_d_${_id[${_drives}]}[nvme_spare]");
104+
105+
"_nvme_pct_used_entries[${_drives}]"
106+
string => "${_drives}: ${_d_${_id[${_drives}]}[nvme_pct_used]}%",
107+
if => isvariable("_d_${_id[${_drives}]}[nvme_pct_used]");
108+
109+
"_nvme_media_errors_entries[${_drives}]"
110+
string => "${_drives}: ${_d_${_id[${_drives}]}[nvme_media_errors]}",
111+
if => isvariable("_d_${_id[${_drives}]}[nvme_media_errors]");
112+
113+
"_failed_entries[${_drives}]"
114+
string => "${_drives}",
115+
if => strcmp("${_d_${_id[${_drives}]}[health]}", "FAILED");
116+
117+
# Inventory attributes (visible in Mission Portal)
118+
"drive_health"
119+
slist => getvalues(_health_entries),
120+
meta => { "inventory", "attribute_name=SMART drive health" };
121+
122+
"drive_model"
123+
slist => getvalues(_model_entries),
124+
meta => { "inventory", "attribute_name=SMART drive model" };
125+
126+
"drive_temperatures"
127+
slist => getvalues(_temp_entries),
128+
meta => { "inventory", "attribute_name=SMART drive temperatures (C)" };
129+
130+
"drive_power_on_hours"
131+
slist => getvalues(_hours_entries),
132+
meta => { "inventory", "attribute_name=SMART drive power-on hours" };
133+
134+
"nvme_available_spare"
135+
slist => getvalues(_nvme_spare_entries),
136+
meta => { "inventory", "attribute_name=SMART NVMe available spare" };
137+
138+
"nvme_percentage_used"
139+
slist => getvalues(_nvme_pct_used_entries),
140+
meta => { "inventory", "attribute_name=SMART NVMe percentage used" };
141+
142+
"nvme_media_errors"
143+
slist => getvalues(_nvme_media_errors_entries),
144+
meta => { "inventory", "attribute_name=SMART NVMe media errors" };
145+
146+
"failed_drives"
147+
slist => getvalues(_failed_entries),
148+
meta => { "inventory", "attribute_name=SMART failed drives" };
149+
150+
linux.!_have_smartctl::
151+
"drive_health"
152+
string => "SMARTCTL_MISSING",
153+
meta => { "inventory", "attribute_name=SMART drive health" };
154+
155+
reports:
156+
linux._have_smartctl.verbose_mode::
157+
"inventory_smartctl: monitoring ${_drives}";
158+
"inventory_smartctl: ${_drives} health=${_d_${_id[${_drives}]}[health]}"
159+
if => isvariable("_d_${_id[${_drives}]}[health]");
160+
161+
!linux.verbose_mode::
162+
"$(this.promise_filename): inventory_smartctl is Linux-only.";
163+
}
164+
165+
bundle agent parse(drive, cache_file)
166+
# @brief Parse smartctl JSON and return key metrics via bundle_return_value_index
167+
{
168+
vars:
169+
"_json" data => readjson("$(cache_file)");
170+
171+
# Extract metrics directly from JSON
172+
"_health"
173+
string => ifelse(strcmp("${_json[smart_status][passed]}", "true"), "PASSED", "FAILED"),
174+
if => isvariable("_json[smart_status][passed]");
175+
176+
"_model"
177+
string => "${_json[model_name]}",
178+
if => isvariable("_json[model_name]");
179+
180+
"_temp"
181+
string => "${_json[temperature][current]}",
182+
if => isvariable("_json[temperature][current]");
183+
184+
"_hours"
185+
string => "${_json[power_on_time][hours]}",
186+
if => isvariable("_json[power_on_time][hours]");
187+
188+
"_nvme_spare"
189+
string => "${_json[nvme_smart_health_information_log][available_spare]}",
190+
if => isvariable("_json[nvme_smart_health_information_log][available_spare]");
191+
192+
"_nvme_pct_used"
193+
string => "${_json[nvme_smart_health_information_log][percentage_used]}",
194+
if => isvariable("_json[nvme_smart_health_information_log][percentage_used]");
195+
196+
"_nvme_media_errors"
197+
string => "${_json[nvme_smart_health_information_log][media_errors]}",
198+
if => isvariable("_json[nvme_smart_health_information_log][media_errors]");
199+
200+
reports:
201+
"$(_health)" bundle_return_value_index => "health";
202+
"$(_model)" bundle_return_value_index => "model";
203+
"$(_temp)" bundle_return_value_index => "temp";
204+
"$(_hours)" bundle_return_value_index => "hours";
205+
"$(_nvme_spare)" bundle_return_value_index => "nvme_spare";
206+
"$(_nvme_pct_used)" bundle_return_value_index => "nvme_pct_used";
207+
"$(_nvme_media_errors)" bundle_return_value_index => "nvme_media_errors";
208+
}
209+
210+
body file control { namespace => "default"; }
211+
212+
bundle agent __main__
213+
{
214+
methods:
215+
"inventory_smartctl:main";
216+
}

0 commit comments

Comments
 (0)