Glen Pitt-Pladdy :: BlogSMART stats on Cacti (via SNMP) | ||||
Update: This was originally one of my first articles on Cacti stats via SNMP, and subsequently I have built an ever growing collection of templates and extension scripts based on the same approaches. Originally this was done as 2-disk templates which where fine for the machines I was working with - my server here is a basic 2-disk setup, so why would I need more. Since that I've worked with all sorts of different disk arrangements and had to fudge things to make useful templates. This update fixes that by bringing things down to two basic templates and switching to indexed SNMP allowing an arbitrary number of disks to be used. This follows on from the basics of SNMP I did previously, this article adds a set of SNMP extension scripts, config, and Cacti templates to monitor hard drives. Being SMARTSelf Monitoring, Analysis, and Reporting Technology is contained in most hard drives these days. It provides a number of built in tests to evaluate the health of a drive and hopefully predict many failures. Linux has a suite of tools called "smartmontools" which provides a comprehensive set of utilities and a monitoring daemon for checking drives. Configuration of regular testing and monitoring (smartd) is beyond this article and there are plenty of docs around for that already, but what is often useful is to graph key parameters to spot anomalies with parameters which would otherwise go unnoticed. After installing smartmontools, you can check the basic parameters that drives have with the command: smartctl -a DEVICE Where DEVICE is the device for the drive (not a partition). Typically this would be something like /dev/sda (first drive), /dev/sdb (second drive) etc. or /dev/hda (first drive), /dev/hdb (second drive), or some combination of both. If a drive does not have SMART enabled it will say that in the output of the above. To enable SMART on the drive: smartctl -s on DEVICE Note that USB drives do not currently allow SMART data, even though the physical drives inside the boxes are SMART capable. I have no idea why this is the case, and USB drives are the ones I would really like to monitor as they get bashed about more and have poor cooling compared to fixed drives in a system. Getting SMART over SNMPLike discussed previously, SMART data requires root privilege to access, and snmpd runs as a low privilege user. What I do is have a CRON job that reads this data and stores it in files for snmpd to access via extension scripts. If you are using the same config I described previously, then simply add the lines to your /etc/snmp/local-snmp-cronjob file to make it look something like this (may have other content for other tasks):
#!/bin/sh This code simply runs through devices matching /dev/sd? (ie. /dev/sda, /dev/sdb etc.) and dumps their SMART data to a file in /var/local/snmp as described previously. From here extension scripts for snmpd can pick it up without requiring privilege. SMART parameters are numbered and it made sense to me to exploit the numbering in a universal script instead of having to treat each parameter on it's own. Download: Perl script to extract SMART parameters for SNMP I place this script (make it executable first: chmod +x smart-generic) in /etc/snmp This script takes one argument of the SMART parameter number and outputs the difference (remaining life) between the current value and the threshold for that parameter. It is worth noting that different manufacturers (and even different models and revisions of drives) create these values differently so the value is of little interest on it's own, but unusual fluctuations or downward trends are worth taking note of. For temperatures it is normally necessary to take the raw data which can be done by prefixing the parameter ID with a 'R'. In /etc/snmp/snmpd.conf add the following lines (or others if you want to monitor them):
extend smartdevices /etc/snmp/smart-generic devices These are respectively:
There are many other parameters which you could also monitor and as can be seen, they are easily added by simply referencing the parameter ID and updating templates to match. Note that the config presented here only looks at /dev/sd? devices. If your system has /dev/hd? devices then you will need to modify the scripts accordingly. Once you have added all this in you can test smart-generic by running it from the command line with appropriate parameters, and via SNMP by appending the appropriate SNMP OID to the "snmpwalk" commands shown previously. Cacti TemplatesI have generated some basic Cacti Templates for these SMART parameters with one graph for temperatures and another for health parameters. They are easily extended for more parameters. For indexed SNMP, Cacti requires an XML file describing how to map the SNMP data to each drive. As this is a local (unpackaged) version I have done my configuration around putting this file in /usr/local/share/cacti/resource/snmp_queries/ and you will need to alter the templates if you put the file elsewhere. Download: Disk SMART Cacti SNMP Query (XML) Put this in /usr/local/share/cacti/resource/snmp_queries/ or wherever appropriate for your system. Note that if you change the location then you will also need to update the path to this file in the Cacti Data Query for this template. Download: Cacti Templates for SMART over SNMP Simply import this template, and add the data query to the hosts you want to monitor then you should see disks available to monitor and be able to add graphs you want in Cacti. It should just work if your SNMP is working correctly for that device (ensure other SNMP parameters are working for that device). Graph Screen Shots
If you have more disks then you can add a pair of these graphs for every disk. |
||||
|
Disclaimer: This is a load of random thoughts, ideas and other nonsense and is not intended to be taken seriously. I have no idea what I am doing with most of this so if you are stupid and naive enough to believe any of it, it is your own fault and you can live with the consequences. More importantly this blog may contain substances such as humor which have not yet been approved for human (or machine) consumption and could seriously damage your health if taken seriously. If you still feel the need to litigate (or whatever other legal nonsense people have dreamed up now), then please address all complaints and other stupidity to yourself as you clearly "don't get it".
Copyright Glen Pitt-Pladdy
|
||||
Comments:
Hi Glen,
great work you've done. I had never got my head around populating the SNMP MIB with an external process until now. It always looks easy when it's in front of you!
I modified your Perl script and created an SNMP Query XML which will handle any numbers of drives. It uses only two graph templates, one for errors and the other for temperatures. Basically you get 2 graphs per drive. I have a server with 10 drives and did not want to do all the work to add 8 more data sources! And I thought it would be better to make it handle any number of drives. If you're interested I can send you my XML templates.
Regards
Scott
Hi Scott
Sorry for the delay getting back to you about your post on my blog. It did take a while to pull all the fragments of info together to get SNMP working neatly.
Your approach sounds interesting and I'm sure would help lots of people who have servers with more than 2 disks to monitor.
I am happy for you to post a URL on my blog linking to your code & templates, else I am happy to host them off my server.
Either way, please make sure you give yourself credit for the extra work you have put into the template and scripts - I would suggest a comment in the code about the enhancements with your name and a URL.
Thanks
Glen
Could Scot send me your templates, thank a lot
Pitt-Pladdy, thank for it template :D
Thanks for these, I've also altered your templates to support 4 drives if you'd like a copy of these let me know
Hello,
First -thank oyu very much for this, I was strugling to make this for past few weeks and even started writing smart mib and net-snmp smart extension.....
If I may make two points. First, I had to rewrite the path inside the big xml file cact.... to read my directory for disk_smart.xml file. Second, this line posted above:
extend smartreaderr /etc/snmp/smart-generic devices
has to be rewritten as follows:
extend smartdevices /etc/snmp/smart-generic devices
Yes, you are quite right about the snmpd.conf line - have updated this. You should only need to change paths if you are putting the file in a different place.
I am trying to use your tutorial to get smart infos via snmp
i am getting :
NET-SNMP-EXTEND-MIB::nsExtendOutLine."smarttemp".1 = STRING: NA
NET-SNMP-EXTEND-MIB::nsExtendOutLine."smarteccrec".1 = STRING: NA
NET-SNMP-EXTEND-MIB::nsExtendOutLine."smartairflow".1 = STRING: NA
NET-SNMP-EXTEND-MIB::nsExtendOutLine."smartdevices".1 = STRING: /dev/sda
NET-SNMP-EXTEND-MIB::nsExtendOutLine."smartreaderr".1 = STRING: NA
NET-SNMP-EXTEND-MIB::nsExtendOutLine."smartrealloc".1 = STRING: NA
NET-SNMP-EXTEND-MIB::nsExtendOutLine."smartseekerr".1 = STRING: NA
would you please tell me how can i fix this , so i do not get any more NA
Looking at the script, NA is output when it can't get the data about the drive. I would suggest starting by looking at the data file: /var/local/snmp/smart-sda
Check that the cron job is creating it properly with valid data in it.
Hello. I have got this error in Console->Devices->Edit "Data Query Debug Information"
+ Running data query [13].
+ Found type = '3' [snmp query].
+ Found data query XML file at '/usr/share/cacti/site/resource/snmp_queries/disk_smart.xml'
+ XML file parsed ok.
+ Executing SNMP walk for list of indexes @ 'NET-SNMP-EXTEND-MIB::nsExtendOutLine."smartdevices"'
+ No SNMP data returned
+ Found data query XML file at '/usr/share/cacti/site/resource/snmp_queries/disk_smart.xml'
+ Found data query XML file at '/usr/share/cacti/site/resource/snmp_queries/disk_smart.xml'
+ Found data query XML file at '/usr/share/cacti/site/resource/snmp_queries/disk_smart.xml'
I tried runing it on windows machine but i think this instruction is for linux becaues it doesnt work. Or i have do a mistake somewhere in steps above?
These instructions are for Debian or Ubuntu though should work on other Linux distros though may need a few changes. The important thing to note is that using snmpd on the Linux/Unix host we are monitoring and are adding extensions to snmpd to collect the data using smartmontools.
With monitoring Windows hosts you would have to extend the SNMP service in a similar way to provide the same data then you may be able to use this template or adapt this template for Windows use.
I am not sure if there are any implications to running Cacti on Windows and monitoring an Linux/Unix host with the extensions. That may well be possible without to much trouble.
From the error logs you give it looks like Cacti found the data query OK and is trying to check the SNMP where it is failing. My guess is that the host you are trying to monitor is not running the snmpd extensions given here (or Windows equivalent) correctly and that is why it is failing.
I hope that is useful to you.
No no:), I have got virtual machine which is on debian. Cacti snmp php apache are installed the newest from version stable. Snmp works, cacti draws some graphs on other default settings + template with processes(on windows host). This error log is from creating graph for windows host. My question was is this instruction only for linux hosts/servers or this script can 'catch' windows hosts too?
Ah! So if I understand right:
* running Cacti on a Debian VM
* monitoring disks on a Windows host via SNMP
If that's the case then this is not going to work without a whole lot of extra stuff on the Windows host.
The way that this monitoring works is that Cacti acts as a SNMP client (no need to install anything extra on the Cacti box/VM), collecting remote data and graphing. The remote host you are monitoring has to be an SNMP server (in the case of Linux running snmpd).
Stock SNMP servers (Linux or Windows) will not give the SMART data needed for this. In order to send the SMART data over SNMP, we have to extend the SNMP server on the host we are monitoring with a load of scripts. That is described here for monitoring Linux hosts running snmpd only.
It may be possible to extend the Windows SNMP service in a similar way, but that's beyond my Windows knowledge so I can't help with that.
I hope I understood right this time :)
Yes, Youre right. Cacti installed and configured on debian virtual machine vmware(created on vsphere 1.0, running on vserver 2.0). I monitor debian vm, some printers, router, 2 servers(SLES[yet snmp cannot connect to him but working on that]+W2K8R2), windows xp hosts. On windows hosts is installed smartmon tools for windows. smartctl -a /dev/sda recognizes the hdd and writes on cmd cli values of smart so it works directly on host. Maybe is it possible to rewrite script to work with smartctl on windows host?
It does appear it is possible to extend the Windows SNMP service but it requires writing a .dll - see: http://stackoverflow.com/questions/136206/how-can-i-write-an-snmp-agent-or-snmp-extension-agent-dll-in-c-sharp
Another possibility is that NetSNMP (what we use on Linux) is also available for Windows: http://www.net-snmp.org/docs/README.win32.html
The extension scripts are all designed around snmpd on Linux/Unix and will likely need modifying to work with Windows Net-SNMP.
That's not something I can help with - I mainly do Linux/Unix work.
OK thank u very much for tips, i will look at those links and figure something out;)You helped me:) Cheers:)
Is there any possibility that I can get a template that can be imported by cacti Version 0.8.7d
I would like to get this smartmon work desperately... If you can get me a cacti 0.8.7d template I would be thankful...
Thanks again!
This question comes up regularly enough that I have just created an article about it. See: http://www.pitt-pladdy.com/blog/_20120305-102839_0000_Cacti_hack_for_forward_compatibility/
okay. I tried this- updated the global_arrays.php.... now I get a different error:
"Error: XML: Generated with a newer version of Cacti."
Earlier i got this one- "Error: XML: Hash version does not exist." Can you please help me matey!
That's a very old version of Cacti so it's possible it's simply too old. I might well be a whole lot less work just to upgrade Cacti to a more recent version.
Thanks mannn...
Anyways if anybody wants to try importing, the trick is to change the hash numbers the numbers from the 3rd to 6th pos represent the version that the cacti supports min. If you change it manually it will be imported. But whether it will work or not will be dependent on the features used in that template!
with respect to this template it works with 0.8.7d...
But soon after that I just got NaN everywhere, not something I would expect! :(
Reading thru the comments, I found out that
1. the local-snmp-cronjob file had to be modified - for some reason 'sed' in the way mentioned doesnt work in mine (both red hat/ suse)
2. smart-generic must have the valid path, i.e., if u hav not used the default path /var/local/snmp then u need to change $FILE in smart-generic
After these changes were made, it works.... partially!
This is where I need your help... again!
-The first graph looks fine the temp is 25-30 and the airflow is 100 (always, is this ok?)
-The second graph has only 3 values, rest are nan. (ie., Reallocated Sectors, Seek Errors and ECC Recovered have values. The other fields- Raw errors, Poweron Hours, High Fly Writes are NaN)
is the smart-generic not parsing properly? may be the smartmonctl has a different output? may be becos of different version? i have my manual unix thing graphing poweron hrs. So it is there, it is just some text parsing issues right matey?
SMART does vary between different drives. Check your snmpd.conf the spec for which parameter airflow is referencing. If there is an "R" in front of the parameter number then it takes the RAW_VALUE column rather than the VALUE column. For temperatures I would normally take the RAW_VALUE which on all my drives is the actual measured temperature, hence airflow is R190.
It sounds like you may have a drive that is putting a processed number into the RAW_VALUE field for airflow. Have a look at the cache files generated by the cron job and see what is in those parameters for your drives. On almost all my drives temp and airflow are very close.
For "nan" fields, your drives may have fewer and/or different parameters. Again, check the cache file for the drive. Since all the parameters are referenced by their number, you can add more to snmpd.conf, update the .xml with them, and then in Cacti update the Data Template, Graph Template and the Data Query accordingly, then re-add the graphs.
thanks for the reply... will look at the cache files tonight and try to get some positive reply mate!
i am pretty sure I will have some q's for you!! thanks again
had a quick look at the cache file and the data template... why is the airflow min max set to 0-1000 and others 0-100? any idea? could it be the reason why airflow is always 1000? the cache shows identical airflow and temp readings.
the other values are pretty high in the cache file.. however the data template is set to expect values between 0-100. could it be why i get NaN?
please ignore the post above...
had a quick look at the cache file and the data template... why is the airflow min max set to 0-1000 and others 0-100? any idea? could it be the reason why airflow is always 100? the cache shows identical airflow and temp readings around 25-30.
the other values are pretty high in the cache file.. for eg. Power on hrs=382. Seekerror Rate=21639493. However the data template is set to expect values between 0-100. could it be why i get NaN?
Regd. using R as a prefix in snmpd.conf, are you saying R190 gets the RAW_VALUE. Using just 190 gets me the VALUE? I will try few things tonight and get back to you with details.
There is a chance that the data is not being picked up correctly by smart-generic. Try running it manually and see what you get:
$ /etc/snmp/smart-generic R190
35
34
34
25
The limits could be responsible for the NaNs but again smart-generic should be picking up the VALUE (remaining life / health) field for that and that is normally in the range 0-100 on my drives. Again different drives may report different numbers.
You can also try running smart-generic for the parameters you are getting NaN on and see what happens, but make sure you run it exactly as in snmpd.conf so that it picks up RAW_VALUE or VALUE as snmpd would. Running an snmpwalk may also be useful to know what Cacti is receiving via snmp.
extend smartdevices /etc/snmp/smart-generic devices
extend smartreaderr /etc/snmp/smart-generic 1
extend smartrealloc /etc/snmp/smart-generic 5
extend smartseekerr /etc/snmp/smart-generic 7
extend smartseekerr /etc/snmp/smart-generic 9
extend smartairflow /etc/snmp/smart-generic 189
extend smartairflow /etc/snmp/smart-generic R190
extend smarttemp /etc/snmp/smart-generic R194
extend smarteccrec /etc/snmp/smart-generic 195
are respectively:
The device list used as the index
Raw_Read_Error_Rate
Reallocated_Sector_Ct
Seek_Error_Rate
Power_On_Hours
High_Fly_Writes
Airflow_Temperature_Cel (RAW)
Temperature_Celsius (RAW)
Hardware_ECC_Recovered
Are you sure there are two smartairflow and 2 smartseekerr
extend smartseekerr /etc/snmp/smart-generic 7
extend smartseekerr /etc/snmp/smart-generic 9
extend smartairflow /etc/snmp/smart-generic 189
extend smartairflow /etc/snmp/smart-generic R190
Wow! Well spotted! I must have messed that up when I updated this article with the new indexed template.
I have updated the article with the correct entries for snmpd.conf (pasted directly from the server I develop the template on).
okay..
I ran smart-generic for all the parameters needed by graph.. they are right.
I highly doubt that there is some typo in the extend thing done in snmpd.conf. shudnt there be any connection between the snmpd.conf and the data template? can u please recheck?
How should i read these extends using snmpwalk? can u please show me an example mate!
For an Indexed template the entries in snmpd.conf are connected to the Data Sources via the .xml file and the associated Data Query in the template. As some of your graphs are working I think it's likely the .xml and the Data Query are working.
I definitely think it's worth running a snmpwalk against NET-SNMP-EXTEND-MIB::nsExtendOutLine (like described in my "SNMP basics" article) to see if all the SMART parameters are being reported correctly - at least then we know if the problems are SNMP or Cacti. You should get stuff like:
...
NET-SNMP-EXTEND-MIB::nsExtendOutLine."smartpwrcyc".2 = STRING: 77
NET-SNMP-EXTEND-MIB::nsExtendOutLine."smartairflow".1 = STRING: 35
NET-SNMP-EXTEND-MIB::nsExtendOutLine."smartairflow".2 = STRING: 34
NET-SNMP-EXTEND-MIB::nsExtendOutLine."smartdevices".1 = STRING: /dev/sda
...
Check all the problem parameters are being reported correctly. You may want to pipe the command through "grep smart" to extract only the SMART lines.
All good matey! thanks for all the help...