Glen Pitt-Pladdy :: Blog
SMART stats on Cacti (via SNMP)
Update: This was originally one of my first articles on Cacti stats via SNMP, and subsequently I have built an ever growing collection of templates and extension scripts based on the same approaches. Originally this was done as 2-disk templates which where fine for the machines I was working with - my server here is a basic 2-disk setup, so why would I need more. Since that I've worked with all sorts of different disk arrangements and had to fudge things to make useful templates. This update fixes that by bringing things down to two basic templates and switching to indexed SNMP allowing an arbitrary number of disks to be used. As of version 20121214 you can also easily edit the script to index by devices or serial numbers.
This follows on from the basics of SNMP I did previously, this article adds a set of SNMP extension scripts, config, and Cacti templates to monitor hard drives.
Self Monitoring, Analysis, and Reporting Technology is contained in most hard drives these days. It provides a number of built in tests to evaluate the health of a drive and hopefully predict many failures.
Linux has a suite of tools called "smartmontools" which provides a comprehensive set of utilities and a monitoring daemon for checking drives. Configuration of regular testing and monitoring (smartd) is beyond this article and there are plenty of docs around for that already, but what is often useful is to graph key parameters to spot anomalies with parameters which would otherwise go unnoticed.
After installing smartmontools, you can check the basic parameters that drives have with the command:
smartctl -a DEVICE
Where DEVICE is the device for the drive (not a partition). Typically this would be something like /dev/sda (first drive), /dev/sdb (second drive) etc. or /dev/hda (first drive), /dev/hdb (second drive), or some combination of both.
If a drive does not have SMART enabled it will say that in the output of the above. To enable SMART on the drive:
smartctl -s on DEVICE
Note that USB drives do not currently allow SMART data, even though the physical drives inside the boxes are SMART capable. I have no idea why this is the case, and USB drives are the ones I would really like to monitor as they get bashed about more and have poor cooling compared to fixed drives in a system.
Getting SMART over SNMP
Like discussed previously, SMART data requires root privilege to access, and snmpd runs as a low privilege user. What I do is have a CRON job that reads this data and stores it in files for snmpd to access via extension scripts.
If you are using the same config I described previously, then simply add the lines to your /etc/snmp/local-snmp-cronjob file to make it look something like this (may have other content for other tasks):
# where to keep the files
This code simply runs through devices matching /dev/sd? (ie. /dev/sda, /dev/sdb etc.) and dumps their SMART data to a file in /var/local/snmp as described previously. From here extension scripts for snmpd can pick it up without requiring privilege.
At this point STOP. Wait for the smart-* files to be created in /var/local/snmp. I suspect a lot of problems reported relate to not getting an early part of the chain working fully before moving on. Don't move on until the files exists.
SMART parameters are numbered and it made sense to me to exploit the numbering in a universal script instead of having to treat each parameter on it's own.
I place this script (make it executable first: chmod +x smart-generic) in /etc/snmp
This script takes one argument of the SMART parameter number and outputs the difference (remaining life) between the current value and the threshold for that parameter. It is worth noting that different manufacturers (and even different models and revisions of drives) create these values differently so the value is of little interest on it's own, but unusual fluctuations or downward trends are worth taking note of. For temperatures it is normally necessary to take the raw data which can be done by prefixing the parameter ID with a 'R'.
This is another good time to STOP. Test that the smart-generic script is actually picking up the data:
# /etc/snmp/smart-generic 1
In /etc/snmp/snmpd.conf add the following lines (or others if you want to monitor them):
extend smartdevices /etc/snmp/smart-generic devices
There are many other parameters which you could also monitor and as can be seen, they are easily added by simply referencing the parameter ID and updating templates to match.
Note that the config presented here only looks at /dev/sd? devices. If your system has /dev/hd? devices then you will need to modify the scripts accordingly.
Once you have added all this restart snmpd.
This is another good point to STOP. You can test smart-generic by running it from the command line with appropriate parameters, and via SNMP by appending the appropriate SNMP OID to the "snmpwalk" commands shown in previous articles. Ensure that you get valid output on "snmp" related fields when you walk the extended OID: NET-SNMP-EXTEND-MIB::nsExtendOutLine
I have generated some basic Cacti Templates for these SMART parameters with one graph for temperatures and another for health parameters. They are easily extended for more parameters.
For indexed SNMP, Cacti requires an XML file describing how to map the SNMP data to each drive. As this is a local (unpackaged) version I have done my configuration around putting this file in /usr/local/share/cacti/resource/snmp_queries/ and you will need to alter the templates if you put the file elsewhere.
Download: Disk SMART Cacti SNMP Query (XML)
Put this in /usr/local/share/cacti/resource/snmp_queries/ or wherever appropriate for your system. Note that if you change the location then you will also need to update the path to this file in the Cacti Data Query for this template.
Download: Cacti Templates for SMART over SNMP
Simply import this template, and add the data query to the hosts you want to monitor then you should see disks available to monitor and be able to add graphs you want in Cacti. It should just work if your SNMP is working correctly for that device (ensure other SNMP parameters are working for that device).
The big improvement as of version 20121214 is that there are a load of new parameters and templates to support for SSDs. While HDDs have mostly the same stuff from model to model and make to make, every SSD manufacturer has their own ideas on what SMART parameters matter, and that's not surprising since the chipsets are also very different.
I have provided a generic SSD template with everything that seemed to matter on the SSDs I've encournterded, but if you use this template directly then you will end up with loads of nan values. The idea is that you can duplicate this template and then prune that down for the specific model of SSD you are monitoring. I have done this for OCZ Agility-3 and Samsung 830 series devices, but you are free to do this for whatever model you have.
Scaling / normalisation
In many cases different manufacturers (and sometimes even models) have different starting values and thresholds for their parameters. As of version 20121214, the smart-generic script assumes that all parameters start at 100 (mostly the case) and scales them to an graph value of zero at the threshold.
This is however not always the case. Typically a few parameters on most drives may have a different starting or normal value so there is now a pair of hashes (%SCALEBYFAMILY and %SCALEBYMODEL) which provide custom overrides on a device family or device model basis.
Additionally occasionally parameters need ignoring as they are there for informational purposes (via raw values) rather than indicating the drive health and will typically have all zero health values. You can set a scaling of 'U' to hide these values.
Graph Screen Shots
Generic temperature template:
Generic HDD template:
Samsung 830 series SSD template:
OCZ Agility-3 template:
Copyright Glen Pitt-Pladdy 2008-2014