Menu
Index

Contact
Atom Feed
Comments Atom Feed

Similar Articles

2011-12-22 09:30
Peak Network Bandwidth for Cacti
2011-11-16 20:12
OpenVz User Beancounters (UBC) on Cacti via SNMP
2011-08-24 19:10
TEMPer under Linux (perl) with Cacti
2011-06-25 12:33
Dovecot stats on Cacti (via SNMP)
2011-06-01 15:27
Pinger improved (with Cacti)

Recent Articles

2019-07-28 16:35
git http with Nginx via Flask wsgi application (git4nginx)
2018-05-15 16:48
Raspberry Pi Camera, IR Lights and more
2017-04-23 14:21
Raspberry Pi SD Card Test
2017-04-07 10:54
DNS Firewall (blackhole malicious, like Pi-hole) with bind9
2017-03-28 13:07
Kubernetes to learn Part 4

Glen Pitt-Pladdy :: Blog

Debugging Cacti Problems

Since I have quite a few Cacti templates I've created over the years, I get quite a few requests for help when they don't work.

I think one of the big problems is that there's a lack of a "helicopter view" of how it all works, and what sort of things go wrong. This posting is an attempt to start changing that.

Ingesting data

Cacti has two basic ways of getting data in, and which to use depends on the type of data. I'm doing this in the reverse order since it starts off with least detail, and drills down, plus this is also the way that graphs are created in Cacti.

Basic

This is where you have a fixed data set like Load Average. There is only one, or one set of data per device, so it's very simple.

3 - Graph

Graphs are normally created from a template, for each device. The template will have associated Data Source(s) which it draws data from. Each data source is an RRD file and this is where the graphs are generated from.

When a Graph is created from a template, Data Source(s) used by the graph are generated from their templates that are needed by the Graph. Note that removing the Graph doesn't remove the Data Sources unless you select to do that, and even then it doesn't remove the RRD files - they just stop being updated.

2 - Data Source

Data sources are again normally created from a template, and then one or more may be used in a Graph. Data Sources can have one or more souces of data and each Data Source has an RRD file associated with it. Updating things in the Data Source settings doesn't updated the corresponding settings in the RRD file. You can either manually change them with rrdtool or delete the RRD file and it will be created again on the next sample with the new settings, but obviously you loose the data that was in it.

Each Data Source has a Data Input Method associated with it which defines how to get the data. Along with that the settings to get the particular bit of data (eg. the OID for SNMP)

2 - Data Input Method

This could be by SNMP or running a script that could collect local data or retrieve it remotely via some protocol (eg. ssh). My preference is to use SNMP since it's widely supported and provides the only viable "one size fits all" option.

Scripts generally output white-space separated data that can then be parsed for each value.

SNMP sticks with one data item per OID.

Indexed

When there could be multiple of something and you want to be able to discover and identify the things separately then this comes into play. It could be NICs, Drives, Mountpoints, or any number of things where each device could have a different number of configuration. In this case things are slightly different to accommodate this.

4 - Graph

Like with the Basic case, Indexed Graphs have associated Data Source(s). Where things differ is that the Data Source(s).

3 - Data Source

Similarly Data Sources define how to retrieve data and have an RRD file associated with them, but with a small difference. They are set to have an Indexed Data Input Method, and don't have specifics about retrieving the data. eg. there is no OID defined for SNMP Input Methods. The actual specifics come from a Data Query.

2 - Data Query

This uses an XML file to define how to retrieve and group the data. This will provide a bunch of named sources and associate them with different Data Source names. For example a NIC might be grouped by device name (eg. eth0, eth1, eth2 ...) and then for each of those sources could be Packets In, Packets Out, Data In, Data Out, Discards etc., each mapped onto a corresponding Data Source.

The XML file is where the details are like the OID pattern (not fixed) to use for SNMP and the like.

The important thing to note is that when a Data Query is added to a device it will update showing the list of available groups / sub-devices:

Cacti Data Queries on Device Page

Or on the New Graph Page:

Cacti Data Queries on New Graph Page

An important thing to note here is the Green Rings on the right. These can be clicked to refresh the Index which could be useful if any of the devices have changed or there was a problem picking the data up.

1 - Data Input Method

Like with the Basic case, this provides a means to collect the data, be it a script, SNMP or whatever. The important thing is that this is formatted in a way that the Data Query can group the data as you need.

Poller Runs

What happens when the poller runs is it goes off and collects the necessary data using the methods configured and caches it. Then the cached data is committed to the RRD files (or on the first poll the RRD files are created). Once there is data in the RRD files

Generally this is where problems start to be seen.

Debugging

Since there are a number of defined steps that Cacti goes through we can check that thing are working at each stage.

Device Configuration

It's not uncommon that devices are not picking up data from the start. This often means that for SNMP configured hosts that you won't see vaild info in the top of the Device page:

Cacti SNMP Information on Device Page

Checks:

  • Device is Up (pinging from Cacti)
  • Valid configuration
  • For SNMP check the SNMP Information is being displayed
Logging

When the Poller kicks off it will log in the Cacti log file assuming you haven't disabled this. On Debian based systems this is /var/log/cacti/cacti.log and will have something like:

1/08/2015 08:25:15 PM - CMDPHP: Poller[0] Host[1] DS[2366] WARNING: Result from SNMP not valid.  Partial Result: U
11/08/2015 08:25:15 PM - CMDPHP: Poller[0] Host[1] DS[2367] WARNING: Result from SNMP not valid.  Partial Result: U
11/08/2015 08:25:15 PM - CMDPHP: Poller[0] WARNING: SNMP Get Timeout for Host:'127.0.0.1', and OID:'NET-SNMP-EXTEND-MIB::nsExtendOutLine."postfixsmtpstatus".11'
11/08/2015 08:25:15 PM - CMDPHP: Poller[0] Host[1] DS[2477] WARNING: Result from SNMP not valid.  Partial Result: U
11/08/2015 08:25:17 PM - CMDPHP: Poller[0] Host[1] DS[2548] WARNING: Result from SNMP not valid.  Partial Result: U
11/08/2015 08:25:17 PM - CMDPHP: Poller[0] Host[1] DS[2562] WARNING: Result from SNMP not valid.  Partial Result: U
11/08/2015 08:25:18 PM - SYSTEM STATS: Time:16.6126 Method:cmd.php Processes:8 Threads:N/A Hosts:8 HostsPerProcess:1 DataSources:1198 RRDsProcessed:1168

So here right away you will see some interesting information. In this case we get a load of stats, but we also get problems with some Data Sources and the associated numbers and in one case the SNMP OID are displayed.

In cases where we have "U" this is a deliberate thing in the collection scripts to signify invalid data. It could be that for example a particular drive type doesn't have some SMART parameters being polled.

Now, one thing you can do is actually try and manually collect data. If it's a script then try running the script command used to retrieve data. If it's SNMP then run some snmpwalk commands up the appropriate OIDs for the data as in my SNMP Basics article.

Increased Logging

Logging can be increased in the settings:

Cacti Log Level on Settings Page

Start increasing the Log Level and check what you get. DEBUG is likely the highest you will need to go, but you could push it further if you need.

11/08/2015 08:35:19 PM - POLLER: Poller[0] CACTI2RRD: /usr/bin/rrdtool update /var/lib/cacti/rra/####_lmsensor_temp5_2667.rrd --template lmsensor_temp5 1447014901:33.0
11/08/2015 08:35:19 PM - CMDPHP: Poller[0] Host[1] DS[2668] SNMP: v3: 127.0.0.1, dsname: lmsensor_temp6, oid: NET-SNMP-EXTEND-MIB::nsExtendOutLine."sensortemps".6, output: 40.5
11/08/2015 08:35:19 PM - PHPSVR: Poller[0] DEBUG: PHP Script Server Shutdown request received, exiting
11/08/2015 08:35:19 PM - POLLER: Poller[0] CACTI2RRD: /usr/bin/rrdtool update /var/lib/cacti/rra/####_lmsensor_temp6_2668.rrd --template lmsensor_temp6 1447014901:40.5
11/08/2015 08:35:19 PM - CMDPHP: Poller[0] Time: 18.8423 s, Theads: N/A, Hosts: 1
11/08/2015 08:35:19 PM - SYSTEM STATS: Time:18.8916 Method:cmd.php Processes:8 Threads:N/A Hosts:8 HostsPerProcess:1 DataSources:1198 RRDsProcessed:1168

Now you are actually getting details like the actual OID being polled and the values returned. This allows you to validate the particular data that is being returned and the files it's going into, and the command being run to put the data into the RRD file.

RRD File

We should really check the RRD file. To do this run:

$ rrdtool info somefile.rrd
filename = "somefile.rrd"
rrd_version = "0003"
step = 300
last_update = 1437836402
header_size = 4536
ds[smart_10].index = 0
ds[smart_10].type = "GAUGE"
ds[smart_10].minimal_heartbeat = 600
ds[smart_10].min = 0.0000000000e+00
ds[smart_10].max = 1.0200000000e+02
ds[smart_10].last_ds = "100"
ds[smart_10].value = 2.0000000000e+02
ds[smart_10].unknown_sec = 0
rra[0].cf = "AVERAGE"
rra[0].rows = 500

Key things to look out for are last_ds which shows the last value and last_update which is the time as Unix Epoch when the update was made. You can convert the time at Epoch Converter.

Another thing to look at is the min and max since if the data being added falls outside of those then the data will become a NaN (Not a Number) value.

One thing I've seen recently is a case where the RRD files ended up created with invalid step and minimal_heartbeat settings. These relate to the polling rate that Cacti is set to which in most cases will be 300 seconds (5 minutes). In this case step will be set to 300, and typically minimal_heartbeat is set to double this. If polling occurs at a longer interval than minimal_heartbeat then data will not be considered valid in the RRD file (NaN).

You can also use the "dump" function to examine all the data in the file and see if it's being added consistently.

If data is not being added then you may want to check permissions of the file, directory etc. to verify if it's all right. If it is then try a manual update based on the data in the log. You will want to add time to Epoch for the sample to match the next data sample for the file. Remember to also "su - CactiUser" to ensure that you are adding data exactly the same way as it would normally. One small difference to the command though is to use "updatev" instead of "update" which essentially means Verbose:

$ rrdtool updatev somefile.rrd --template smart_10 Epoch:90
return_value = 0
[1437836700]RRA[AVERAGE][1]DS[smart_10] = 9.0066666667e+01
[1437836700]RRA[AVERAGE][1]DS[smart_10] = 9.0066666667e+01
[1437836700]RRA[MIN][1]DS[smart_10] = 9.0066666667e+01
[1437836700]RRA[MIN][1]DS[smart_10] = 9.0066666667e+01
[1437836700]RRA[MAX][1]DS[smart_10] = 9.0066666667e+01
[1437836700]RRA[MAX][1]DS[smart_10] = 9.0066666667e+01
[1437836700]RRA[LAST][1]DS[smart_10] = 9.0066666667e+01
[1437836700]RRA[LAST][1]DS[smart_10] = 9.0066666667e+01

Try adding more than one sample, incrementing the Epoch appropriately each time. If you have a sample out of range (may require more than one) you could get something like:

$ rrdtool updatev somefile.rrd --template smart_10 Epoch:999
return_value = 0
[1437837300]RRA[AVERAGE][1]DS[smart_10] = NaN
[1437837300]RRA[AVERAGE][1]DS[smart_10] = NaN
[1437837300]RRA[MIN][1]DS[smart_10] = NaN
[1437837300]RRA[MIN][1]DS[smart_10] = NaN
[1437837300]RRA[MAX][1]DS[smart_10] = NaN
[1437837300]RRA[MAX][1]DS[smart_10] = NaN
[1437837300]RRA[LAST][1]DS[smart_10] = NaN
[1437837300]RRA[LAST][1]DS[smart_10] = NaN

Graphs

Once you have valid data in your RRD file, and remember it may take more than one and there are problems with graphs then you can go and find the individual graph that is giving problems and click on "*Turn Off Graph Debug Mode" which will then print a load of additional data on the command being used to generate the graph and any errors it encounters:

Cacti Graph Debug Mode

Graph But No Data Shown

With Indexed sources a problem I've seen is that the graphs get out of sync with the device (eg. a NIC was added, or devices where detected in different order due to a reboot). In this case clicking the Green Rings with the Data Query (see above) should do the trick.

Another check to make on Data Queries is that all the Data Sources are correctly associated. This can be found in Console->Data Queries->select query (at this stage check for "Successfully located XML file" in Green)->each template:

Cacti Data Query Association

One thing to check is that the checkbox on the right is checked.

Changing Data Source Settings

A common problem I see with Cacti is that when a problem is found, changes are made to correct this but nothing appears to happen. This is often down to the fact that Cacti (at least the versions I've seen) does not change already-existing RRD files. The RRD files are created on the first polling cycle from the settings in the Data Source, but after that there are no changes made to the RRD configuration.

I've seen plugins for manipulating RRD files, but in many cases the most practical things to do would be to use "rrdtool tune" to manually change RRD file settings where possible, or in cases where the Data Source has never worked then simply deleting the RRD file and letting Cacti recreate it with the correct settings on the next polling cycle is probably the most practical thing to do.

Other Stuff

One thing that I've noticed is that some errors are due to timeouts (eg. SNMP) which can result in part of the Data Sources for a device not updating and other updating, while a small group may be intermittent. In this case you will need to tune the timeouts and number of poller threads to avoid this.