Glen Pitt-Pladdy :: Blog
SMART stats on Cacti (via SNMP)
Update: This was originally one of my first articles on Cacti stats via SNMP, and subsequently I have built an ever growing collection of templates and extension scripts based on the same approaches. Originally this was done as 2-disk templates which where fine for the machines I was working with - my server here is a basic 2-disk setup, so why would I need more. Since that I've worked with all sorts of different disk arrangements and had to fudge things to make useful templates. This update fixes that by bringing things down to two basic templates and switching to indexed SNMP allowing an arbitrary number of disks to be used. As of version 20121214 you can also easily edit the script to index by devices or serial numbers.
This follows on from the basics of SNMP I did previously, this article adds a set of SNMP extension scripts, config, and Cacti templates to monitor hard drives.
Self Monitoring, Analysis, and Reporting Technology is contained in most hard drives these days. It provides a number of built in tests to evaluate the health of a drive and hopefully predict many failures.
Linux has a suite of tools called "smartmontools" which provides a comprehensive set of utilities and a monitoring daemon for checking drives. Configuration of regular testing and monitoring (smartd) is beyond this article and there are plenty of docs around for that already, but what is often useful is to graph key parameters to spot anomalies with parameters which would otherwise go unnoticed.
After installing smartmontools, you can check the basic parameters that drives have with the command:
smartctl -a DEVICE
Where DEVICE is the device for the drive (not a partition). Typically this would be something like /dev/sda (first drive), /dev/sdb (second drive) etc. or /dev/hda (first drive), /dev/hdb (second drive), or some combination of both.
If a drive does not have SMART enabled it will say that in the output of the above. To enable SMART on the drive:
smartctl -s on DEVICE
Note that USB drives do not currently allow SMART data, even though the physical drives inside the boxes are SMART capable. I have no idea why this is the case, and USB drives are the ones I would really like to monitor as they get bashed about more and have poor cooling compared to fixed drives in a system.
Getting SMART over SNMP
Like discussed previously, SMART data requires root privilege to access, and snmpd runs as a low privilege user. What I do is have a CRON job that reads this data and stores it in files for snmpd to access via extension scripts.
If you are using the same config I described previously, then simply take the cron script smart-cron, make it executable and add the lines to your /etc/snmp/local-snmp-cronjob file to make it run this as one of the tasks:
# collect SMART DATA
This code in smart-cron simply runs through devices matching /dev/sd? (ie. /dev/sda, /dev/sdb etc.) and dumps their SMART data to a file in /var/local/snmp as described previously. From here extension scripts for snmpd can pick it up without requiring privilege.
A special case here is if you have a situation where devices get re-ordered on boot (eg. two controllers which may be detected in different order) and you need to force an fixed ordering. Edit smart-cron and uncomment the SMARTLIST variable, and add in all the device paths that will not change (suggest using /dev/disk/by-id/*). You should go clear out any smart-* files in /var/local/snmp if they have already been created or if you change config to avoid invalid data being left behind.
At this point STOP. Wait for the smart-* files to be created in /var/local/snmp. I suspect a lot of problems reported relate to not getting an early part of the chain working fully before moving on. Don't move on until the files exists.
SMART parameters are numbered and it made sense to me to exploit the numbering in a universal script instead of having to treat each parameter on it's own.
I place the parameter parsing script smart-generic (make it executable first: chmod +x smart-generic) in /etc/snmp
This script takes one argument of the SMART parameter number and outputs the difference (remaining life) between the current value and the threshold for that parameter. It is worth noting that different manufacturers (and even different models and revisions of drives) create these values differently so the value is of little interest on it's own, but unusual fluctuations or downward trends are worth taking note of. For temperatures it is normally necessary to take the raw data which can be done by prefixing the parameter ID with a 'R'.
This is another good time to STOP. Test that the smart-generic script is actually picking up the data:
# /etc/snmp/smart-generic 1
In /etc/snmp/snmpd.conf add the following lines (or others if you want to monitor them):
extend smartdevices /etc/snmp/smart-generic devices
There are many other parameters which you could also monitor and as can be seen, they are easily added by simply referencing the parameter ID and updating templates to match.
Note that the config presented here only looks at /dev/sd? devices. If your system has /dev/hd? devices then you will need to modify the scripts accordingly.
Once you have added all this restart snmpd.
This is another good point to STOP. You can test smart-generic by running it from the command line with appropriate parameters, and via SNMP by appending the appropriate SNMP OID to the "snmpwalk" commands shown in previous articles. Ensure that you get valid output on "snmp" related fields when you walk the extended OID: NET-SNMP-EXTEND-MIB::nsExtendOutLine
I have generated some basic Cacti Templates for these SMART parameters with one graph for temperatures and another for health parameters. They are easily extended for more parameters.
For indexed SNMP, Cacti requires an XML file describing how to map the SNMP data to each drive. As this is a local (unpackaged) version I have done my configuration around putting this file in /usr/local/share/cacti/resource/snmp_queries/ and you will need to alter the templates if you put the file elsewhere.
Put the data query XML disk_smart.xml in /usr/local/share/cacti/resource/snmp_queries/ or wherever appropriate for your system. Note that if you change the location then you will also need to update the path to this file in the Cacti Data Query for this template.
Simply import the Cacti template cacti_host_template_smart_parameters.xml, and add the data query to the hosts you want to monitor then you should see disks available to monitor and be able to add graphs you want in Cacti. It should just work if your SNMP is working correctly for that device (ensure other SNMP parameters are working for that device).
The big improvement as of version 20121214 is that there are a load of new parameters and templates to support for SSDs. While HDDs have mostly the same stuff from model to model and make to make, every SSD manufacturer has their own ideas on what SMART parameters matter, and that's not surprising since the chipsets are also very different.
I have provided a generic SSD template with everything that seemed to matter on the SSDs I've encournterded, but if you use this template directly then you will end up with loads of nan values. The idea is that you can duplicate this template and then prune that down for the specific model of SSD you are monitoring. I have done this for OCZ Agility-3 and Samsung 830 series devices, but you are free to do this for whatever model you have.
Scaling / normalisation
In many cases different manufacturers (and sometimes even models) have different starting values and thresholds for their parameters. As of version 20121214, the smart-generic script assumes that all parameters start at 100 (mostly the case) and scales them to an graph value of zero at the threshold.
This is however not always the case. Typically a few parameters on most drives may have a different starting or normal value so there is now a pair of hashes (%SCALEBYFAMILY and %SCALEBYMODEL) which provide custom overrides on a device family or device model basis.
Additionally occasionally parameters need ignoring as they are there for informational purposes (via raw values) rather than indicating the drive health and will typically have all zero health values. You can set a scaling of 'U' to hide these values.
Graph Screen Shots
Generic temperature template:
Generic HDD template:
Samsung 830 series SSD template:
OCZ Agility-3 template:
Copyright Glen Pitt-Pladdy 2008-2016