Skip to content

Commit 4414bd0

Browse files
committed
Added S.M.A.R.T. inventory
This module collects S.M.A.R.T. data from storage devices and exposes it as inventory attributes in Mission Portal. It monitors drive health status, temperature, power-on hours, and NVMe-specific metrics. Ticket: CFE-4653
1 parent 9eff4e3 commit 4414bd0

4 files changed

Lines changed: 368 additions & 0 deletions

File tree

cfbs.json

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -196,6 +196,16 @@
196196
"bundles inventory_fde:main"
197197
]
198198
},
199+
"inventory-smartctl": {
200+
"description": "Inventory SMART drive health, temperature, and wear data.",
201+
"tags": ["inventory", "monitoring", "hardware", "storage"],
202+
"subdirectory": "inventory/inventory-smartctl",
203+
"steps": [
204+
"copy policy.cf services/cfbs/modules/inventory-smartctl/policy.cf",
205+
"policy_files services/cfbs/modules/inventory-smartctl/policy.cf",
206+
"bundles inventory_smartctl:main"
207+
]
208+
},
199209
"library-for-promise-types-in-bash": {
200210
"description": "Library enabling promise types implemented in bash.",
201211
"subdirectory": "libraries/bash",
Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
# inventory-smartctl
2+
3+
Inventory module for collecting SMART drive health, temperature, and wear data via smartctl.
4+
5+
## Description
6+
7+
This module collects S.M.A.R.T. (Self-Monitoring, Analysis and Reporting Technology) data from storage devices and exposes it as inventory attributes in CFEngine Mission Portal. It monitors drive health status, temperature, power-on hours, and NVMe-specific metrics.
8+
9+
SMART data helps predict drive failures before they occur and provides visibility into storage device health across your infrastructure.
10+
11+
## Requirements
12+
13+
- **Platform:** Linux only (currently)
14+
- **Binary:** `smartctl` from smartmontools package (version 7.0+ for JSON support)
15+
- **Permissions:** Requires root to read SMART data from devices
16+
17+
### Installation
18+
19+
Add to your policy via cfbs:
20+
21+
```bash
22+
cfbs add inventory-smartctl
23+
cfbs install
24+
```
25+
26+
Or include directly in your policy:
27+
28+
```cfengine
29+
bundle agent main
30+
{
31+
methods:
32+
"smartctl" usebundle => inventory_smartctl:main;
33+
}
34+
```
35+
36+
## Inventory Attributes
37+
38+
The following attributes are exposed in Mission Portal:
39+
40+
### Universal Attributes (all drive types)
41+
42+
- **SMART drive health** - Per-drive health status
43+
- Values: `PASSED`, `FAILED`, `SMARTCTL_MISSING`
44+
- Example: `/dev/sda: PASSED`, `/dev/nvme0: FAILED`
45+
- `SMARTCTL_MISSING`: Indicates smartctl is not installed on the system
46+
- Critical: A FAILED status indicates the drive is predicting imminent failure
47+
48+
- **SMART drive model** - Drive model identifier
49+
- Example: `/dev/sda: Samsung SSD 870 EVO`
50+
51+
- **SMART drive temperatures (C)** - Current temperature in Celsius
52+
- Example: `/dev/sda: 35 C`
53+
- Note: Not available for virtual disks
54+
55+
- **SMART drive power-on hours** - Cumulative runtime in hours
56+
- Example: `/dev/sda: 8742 h`
57+
- Useful for tracking drive age and warranty coverage
58+
59+
### NVMe-Specific Attributes
60+
61+
- **SMART NVMe available spare** - Remaining spare blocks (%)
62+
- Example: `/dev/nvme0: 100%`
63+
- Low values (<10%) indicate wear approaching end of life
64+
65+
- **SMART NVMe percentage used** - Drive life consumed (%)
66+
- Example: `/dev/nvme0: 5%`
67+
- Based on manufacturer's endurance rating
68+
69+
- **SMART NVMe media errors** - Uncorrectable media errors count
70+
- Example: `/dev/nvme0: 0`
71+
- Any non-zero value indicates data integrity issues
72+
73+
### Alert Attributes
74+
75+
- **SMART failed drives** - List of drives with FAILED health status
76+
- Only present when one or more drives are failing
77+
- Use for alerting and automated response
78+
79+
## Troubleshooting
80+
81+
### SMARTCTL_MISSING appears in inventory
82+
83+
The module reports `SMARTCTL_MISSING` when smartctl is not installed. To resolve:
84+
85+
**Install smartmontools package:**
86+
87+
```sh
88+
# Debian/Ubuntu
89+
apt-get install smartmontools
90+
91+
# RHEL/CentOS/Fedora
92+
yum install smartmontools
93+
94+
# SUSE
95+
zypper install smartmontools
96+
```
97+
98+
**Verify installation:**
99+
100+
```sh
101+
command -v smartctl
102+
smartctl --version
103+
```
104+
105+
### No inventory data appears
106+
107+
If smartctl is installed but no data appears:
108+
109+
**Check if drives are detected:**
110+
111+
```sh
112+
smartctl --scan
113+
```
114+
115+
**Check cache files:**
116+
117+
```sh
118+
ls -lh /var/cfengine/state/inventory_smartctl_*.json
119+
```
120+
121+
**Run with verbose mode:**
122+
123+
```sh
124+
cf-agent -Kvf ./policy.cf
125+
```
126+
127+
## See Also
128+
129+
- [CFEngine inventory tutorial](https://docs.cfengine.com/docs/lts/examples/tutorials/custom_inventory/)
130+
- [CFEngine Masterfiles inventory policy](https://docs.cfengine.com/docs/lts/reference/masterfiles-policy-framework/inventory/)
131+
- [smartmontools documentation](https://www.smartmontools.org/)
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
{
2+
"name": "inventory-smartctl",
3+
"description": "Inventory SMART drive health, temperature, and wear data",
4+
"tags": ["inventory", "monitoring", "hardware", "storage", "smartctl"],
5+
"version": "0.1.0",
6+
"steps": [
7+
"copy ./policy.cf services/inventory/smartctl.cf"
8+
],
9+
"dependencies": [],
10+
"subdirectory": "inventory/inventory-smartctl"
11+
}
Lines changed: 216 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,216 @@
1+
body file control
2+
{
3+
namespace => "inventory_smartctl";
4+
}
5+
6+
bundle agent main
7+
# @brief Inventory SMART drive health, temperature, and wear data via smartctl JSON
8+
#
9+
# Requires smartmontools >= 7.0 (for JSON output support).
10+
# Runs on Linux only; silently no-ops on other platforms.
11+
#
12+
# Simplified version: reads JSON directly in main bundle, no sub-bundle needed.
13+
#
14+
# Attributes exposed in Mission Portal:
15+
# @inventory SMART drive health - Per-drive PASSED/FAILED
16+
# @inventory SMART drive model - Drive model per device
17+
# @inventory SMART drive temperatures - Current temperature in Celsius
18+
# @inventory SMART drive power-on hours - Cumulative runtime in hours
19+
# @inventory SMART NVMe available spare - Remaining spare blocks (%), NVMe only
20+
# @inventory SMART NVMe percentage used - Drive life consumed (%), NVMe only
21+
# @inventory SMART NVMe media errors - Uncorrectable media errors, NVMe only
22+
# @inventory SMART failed drives - Only present on hosts with a failing drive
23+
{
24+
vars:
25+
linux::
26+
"_smartctl" string => ifelse(
27+
fileexists("/usr/sbin/smartctl"), "/usr/sbin/smartctl",
28+
fileexists("/sbin/smartctl"), "/sbin/smartctl",
29+
"/usr/sbin/smartctl" # default fallback
30+
);
31+
"_sdir" string => "$(sys.statedir)";
32+
"_cache_ttl" string => "3600"; # 1 hour
33+
34+
# Enumerate drives - extract first field from each line of smartctl --scan
35+
"_scan_lines"
36+
slist => splitstring(
37+
execresult("$(_smartctl) --scan 2>/dev/null", "useshell"),
38+
"\n", 32);
39+
40+
"_drives"
41+
slist => maplist(regex_replace("$(this)", "^(\S+).*", "\1", ""), "_scan_lines");
42+
43+
"_id[${_drives}]" string => canonify("${_drives}");
44+
"_cache[${_drives}]" string => "$(_sdir)/inventory_smartctl_${_id[${_drives}]}.json";
45+
46+
classes:
47+
linux::
48+
"_have_smartctl" expression => isexecutable("$(_smartctl)");
49+
50+
# Cache file is missing - needs refresh
51+
"_cache_missing_${_id[${_drives}]}"
52+
not => fileexists("${_cache[${_drives}]}");
53+
54+
# Cache file is stale - needs refresh
55+
"_cache_stale_${_id[${_drives}]}"
56+
expression => isgreaterthan(
57+
eval("$(sys.systime) - $(filestat(${_cache[${_drives}]}, mtime))"),
58+
"$(_cache_ttl)"),
59+
if => fileexists("${_cache[${_drives}]}");
60+
61+
# Refresh if missing or stale
62+
"_refresh_${_id[${_drives}]}"
63+
or => {
64+
"_cache_missing_${_id[${_drives}]}",
65+
"_cache_stale_${_id[${_drives}]}"
66+
};
67+
68+
files:
69+
linux._have_smartctl::
70+
"${_cache[${_drives}]}"
71+
content => execresult("$(_smartctl) -j -a ${_drives}", "noshell", "stdout"),
72+
if => "_refresh_${_id[${_drives}]}";
73+
74+
methods:
75+
linux._have_smartctl::
76+
# Call parsing bundle for each drive (only when cache exists)
77+
"parse_${_id[${_drives}]}"
78+
usebundle => parse("${_drives}", "${_cache[${_drives}]}"),
79+
useresult => "_d_${_id[${_drives}]}",
80+
if => fileexists("${_cache[${_drives}]}");
81+
82+
vars:
83+
linux._have_smartctl::
84+
# Collect results from sub-bundles into formatted entries
85+
"_health_entries[${_drives}]"
86+
string => "${_drives}: ${_d_${_id[${_drives}]}[health]}",
87+
if => isvariable("_d_${_id[${_drives}]}[health]");
88+
89+
"_model_entries[${_drives}]"
90+
string => "${_drives}: ${_d_${_id[${_drives}]}[model]}",
91+
if => isvariable("_d_${_id[${_drives}]}[model]");
92+
93+
"_temp_entries[${_drives}]"
94+
string => "${_drives}: ${_d_${_id[${_drives}]}[temp]} C",
95+
if => isvariable("_d_${_id[${_drives}]}[temp]");
96+
97+
"_hours_entries[${_drives}]"
98+
string => "${_drives}: ${_d_${_id[${_drives}]}[hours]} h",
99+
if => isvariable("_d_${_id[${_drives}]}[hours]");
100+
101+
"_nvme_spare_entries[${_drives}]"
102+
string => "${_drives}: ${_d_${_id[${_drives}]}[nvme_spare]}%",
103+
if => isvariable("_d_${_id[${_drives}]}[nvme_spare]");
104+
105+
"_nvme_pct_used_entries[${_drives}]"
106+
string => "${_drives}: ${_d_${_id[${_drives}]}[nvme_pct_used]}%",
107+
if => isvariable("_d_${_id[${_drives}]}[nvme_pct_used]");
108+
109+
"_nvme_media_errors_entries[${_drives}]"
110+
string => "${_drives}: ${_d_${_id[${_drives}]}[nvme_media_errors]}",
111+
if => isvariable("_d_${_id[${_drives}]}[nvme_media_errors]");
112+
113+
"_failed_entries[${_drives}]"
114+
string => "${_drives}",
115+
if => strcmp("${_d_${_id[${_drives}]}[health]}", "FAILED");
116+
117+
# Inventory attributes (visible in Mission Portal)
118+
"drive_health"
119+
slist => getvalues(_health_entries),
120+
meta => { "inventory", "attribute_name=SMART drive health" };
121+
122+
"drive_model"
123+
slist => getvalues(_model_entries),
124+
meta => { "inventory", "attribute_name=SMART drive model" };
125+
126+
"drive_temperatures"
127+
slist => getvalues(_temp_entries),
128+
meta => { "inventory", "attribute_name=SMART drive temperatures (C)" };
129+
130+
"drive_power_on_hours"
131+
slist => getvalues(_hours_entries),
132+
meta => { "inventory", "attribute_name=SMART drive power-on hours" };
133+
134+
"nvme_available_spare"
135+
slist => getvalues(_nvme_spare_entries),
136+
meta => { "inventory", "attribute_name=SMART NVMe available spare" };
137+
138+
"nvme_percentage_used"
139+
slist => getvalues(_nvme_pct_used_entries),
140+
meta => { "inventory", "attribute_name=SMART NVMe percentage used" };
141+
142+
"nvme_media_errors"
143+
slist => getvalues(_nvme_media_errors_entries),
144+
meta => { "inventory", "attribute_name=SMART NVMe media errors" };
145+
146+
"failed_drives"
147+
slist => getvalues(_failed_entries),
148+
meta => { "inventory", "attribute_name=SMART failed drives" };
149+
150+
linux.!_have_smartctl::
151+
"drive_health"
152+
string => "SMARTCTL_MISSING",
153+
meta => { "inventory", "attribute_name=SMART drive health" };
154+
155+
reports:
156+
linux._have_smartctl.verbose_mode::
157+
"inventory_smartctl: monitoring ${_drives}";
158+
"inventory_smartctl: ${_drives} health=${_d_${_id[${_drives}]}[health]}"
159+
if => isvariable("_d_${_id[${_drives}]}[health]");
160+
161+
!linux.verbose_mode::
162+
"$(this.promise_filename): inventory_smartctl is Linux-only.";
163+
}
164+
165+
bundle agent parse(drive, cache_file)
166+
# @brief Parse smartctl JSON and return key metrics via bundle_return_value_index
167+
{
168+
vars:
169+
"_json" data => readjson("$(cache_file)");
170+
171+
# Extract metrics directly from JSON
172+
"_health"
173+
string => ifelse(strcmp("${_json[smart_status][passed]}", "true"), "PASSED", "FAILED"),
174+
if => isvariable("_json[smart_status][passed]");
175+
176+
"_model"
177+
string => "${_json[model_name]}",
178+
if => isvariable("_json[model_name]");
179+
180+
"_temp"
181+
string => "${_json[temperature][current]}",
182+
if => isvariable("_json[temperature][current]");
183+
184+
"_hours"
185+
string => "${_json[power_on_time][hours]}",
186+
if => isvariable("_json[power_on_time][hours]");
187+
188+
"_nvme_spare"
189+
string => "${_json[nvme_smart_health_information_log][available_spare]}",
190+
if => isvariable("_json[nvme_smart_health_information_log][available_spare]");
191+
192+
"_nvme_pct_used"
193+
string => "${_json[nvme_smart_health_information_log][percentage_used]}",
194+
if => isvariable("_json[nvme_smart_health_information_log][percentage_used]");
195+
196+
"_nvme_media_errors"
197+
string => "${_json[nvme_smart_health_information_log][media_errors]}",
198+
if => isvariable("_json[nvme_smart_health_information_log][media_errors]");
199+
200+
reports:
201+
"$(_health)" bundle_return_value_index => "health";
202+
"$(_model)" bundle_return_value_index => "model";
203+
"$(_temp)" bundle_return_value_index => "temp";
204+
"$(_hours)" bundle_return_value_index => "hours";
205+
"$(_nvme_spare)" bundle_return_value_index => "nvme_spare";
206+
"$(_nvme_pct_used)" bundle_return_value_index => "nvme_pct_used";
207+
"$(_nvme_media_errors)" bundle_return_value_index => "nvme_media_errors";
208+
}
209+
210+
body file control { namespace => "default"; }
211+
212+
bundle agent __main__
213+
{
214+
methods:
215+
"inventory_smartctl:main";
216+
}

0 commit comments

Comments
 (0)