Menu
Index

Contact
Atom Feed
Comments Atom Feed

Similar Articles

2011-11-16 20:12
OpenVz User Beancounters (UBC) on Cacti via SNMP
2015-11-08 19:31
Debugging Cacti Problems
2011-08-24 19:10
TEMPer under Linux (perl) with Cacti
2011-06-25 12:33
Dovecot stats on Cacti (via SNMP)
2009-11-22 16:49
Postfix stats on Cacti (via SNMP)

Recent Articles

2019-07-28 16:35
git http with Nginx via Flask wsgi application (git4nginx)
2018-05-15 16:48
Raspberry Pi Camera, IR Lights and more
2017-04-23 14:21
Raspberry Pi SD Card Test
2017-04-07 10:54
DNS Firewall (blackhole malicious, like Pi-hole) with bind9
2017-03-28 13:07
Kubernetes to learn Part 4

Glen Pitt-Pladdy :: Blog

Pinger improved (with Cacti)

This theme seems to be getting tired, but although I've made many different variants of this over the years, I've managed to avoid publishing many of the previous versions (wrote the articles, just never put them live) of this pinger script, here goes...

Well actually, I've just stopped a few more times while writing this for various enhancements like threading which turned out to be a right pain due to a Perl Bug I've had to work around.

Back to the scripts - all this is is a wrapper for fping (pings lots of hosts in parallel) that logs, analyses and can act upon the results. For example, we can trigger events (emails and external commands) when nodes go up or down, which is really quite useful for warning admins about problems as well as being able to automatically do stuff about them or run diagnostics on intermittent faults while they are happening.

Previous versions started off trivial, but with time they got so complicated that they where virtually unmaintainable, certainly with the time I have to spend on this. I have bitten the bullet and started again with a clean slate. No complex config files or anything - a back to basics comprehensive ping monitor. This time the config file is just Perl so you can use loops to populate config etc.

Much more sensible!

Why?

It may seem rather a bizarre thing to do given the number of ping tools out there to create yet another. The problem I found when trying to use existing tools was that they just didn't quite manage to build a comprehensive enough picture of what was going on during network failures, and most importantly during partial failures (intermittent or high packet loss conditions).

Initially I wrote this for my own monitoring and then added features it as events occurred that I needed more information on. Most notably, during a period of brief but severe routing failures at a company I was working for, I could deploy this tool on several hosts on different networks and locations, collecting detailed information about connectivity to different nodes. When failures occur it logs not only the loss of ping but any ICMP returned by routers (eg. network unreachable) as well as being able to trigger traceroutes and other checking scripts.

This often resulted in me having detailed information about intermittent outages including logs of ICMP returned from failing routers where the network providers I was dealing with where unable to even detect problems using checks from their conventional monitoring tools like Nagios.

How it works

The basic script is quite simple - it runs fping and examines the output. It then counts as well as averages lost or good pings from each node configured and depending on thresholds decides if a node is up or down, and then gets on and does the required action.

The basic features list:

  • Uses fping to monitor lots of nodes simultaneously
  • Runs as a daemon (background process) so can be started as a normal startup process from an init script
  • Provides both "consecutive pings" and "averaged loss" methods of determining the status of a node. This allows fast detection of downed hosts, but also allows detection of periods of high packet loss affecting service.
  • Logs ping times (or NULL on losses) to daily .csv files that you can pull into a spreadsheet for further analysis
  • Logs time averaged losses to daily .csv files that you can pull into a spreadsheet for further analysis
  • Automatically cleans up old .csv files
  • Logs any anomalies reported by fping - eg. "no route to host" for more in-depth investigation of losses
  • Can automatically run external commands and scripts (eg. for  automatic diagnostics, to page someone or even to automatically compensate for the loss) and logs the output of the command
    • Commands get environment variables set for node name, address, seconds in last state, and current state (UP or DOWN)
    • Output of commands is logged in the .csv files
  • Can automatically email reports which include the output of any commands run and the anomalies reported by fping
    • Output of commands run is included in emails
  • Uses threading when running commands and sending emails to ensure that if multiple hosts are affected, diagnostic commands/scripts all run simultaneously rather than serially when the later ones may be running after a brief fault has cleared
  • Config file in Perl which allows a high level of flexibility
  • Re-reads previous time-averages to continue from the same point when restarted (rather than resetting the averages)
  • Cacti templates and scripts provided to graph latency as well as time averaged losses
  • Cacti scripts try and determine the position of the most recent data in the file efficiently and seek to the position rather than reading the whole file  every time - this should help scalability when you have a lot of nodes and/or rapid ping rate
  • Nagios Passive Checks integration so reporting and status can be consolidated through a your Nagios server

You will need the following:

  • Perl - hey, it's is written in Perl!
  • fping - not much point otherwise as this is what does the main work
  • A directory to put the logs (default is /var/local/fping_logger)
  • Cacti... if you want to graph things
  • Nagios... if you want to report via Nagios

Note on fping reliability

I am using the fping from Debian Squeeze which has been patched against using the same sequence numbers. Stock fping is (or at least was) using the same sequence numbers each time per host. The result was that some hosts/firewalls decide that they had seen those pings already and throw the packet, hence generating false losses.

Another problem with fping is that it can get pings out of order. Leave it long enough and you start getting pings arriving for the next cycle before all the ones have been processed for the last cycle. That does rather confuse things as there is no indication that a ping has not arrived other than it isn't there when the next cycle starts, so to avoid that problem I have a config for restarting fping ($FPINGRESTART) which says how many cycles before fping gets restarted (pings resynced). Default is 100, but if you are on particularly wobbly networks and/or rapid ping rates you may find you start getting losses due to the script getting confused with the odd ping arriving out of order then you may want to reduce this. This will often be seen as bursts of losses to most nodes  (all but the one or two that have gone out of order) that end every 100 pings when fping restarts.

Basic Config

The config goes in /etc/fping_logger.pl by default and is simply a lump of perl that is included (eval'd) into the script (yeah - I know it's frowned on to "include" stuff, but it's a very simple way to create a really powerful config file).

Everything above the line with the comment "# Read in the config" may be altered by the config. The main thing we need to put in our config is the %NODES - this is what we actually monitor:

%NODES = (
        '127.0.0.1' => 'Ourselves',
        '10.167.223.10' => 'Router',
        '192.168.37.206' => 'That flaky server',
);

It's a hash full of the node address and a description. That should be sufficient to do some basic logging, but if we want to send email for status changes then we can add those:

%UPDOWNEMAIL = (
        '192.168.37.206' => 'yetagain@dumbfoolhosing.not',
);

And if you want to kick off a command to do something (wake up the sysadmin, run some diagnostics etc.) then you can put that in as well:

%UPDOWNCOMMAND = (
        '192.168.37.206' => 'traceroute -n -I 192.168.37.206',
);

The output of the command will be included in both the email reports and  in the log files with the first column of "#COMMAND#". Also worth noting that error messages (eg. Host unreachable) from fping are also put in the email reports and also in the log files with the beginning of the first column of "#ERROR#".

Thresholds for up/down decisions which are not set are automatically set to the default values in the script, so really you want to set a sane default, and then if there are specific defaults for certain nodes then specify them and the script will fill in the rest for you.

For Nagios integration you need to set $NAGIOSREPORT to the path of the Nagios command file (FIFO):

$NAGIOSREPORT = '/var/lib/nagios3/rw/nagios.cmd';

By default all nodes will be reported to Nagios via their hostnames/IPs, however you can alter that by setting values in %NODENAGIOSREPORT - either set a hostname to match those used in Nagios, or set them to 0 (zero) to disable reporting of that host to Nagios:

%NODENAGIOSREPORT = (
        '127.0.0.1' => 'localhost',   # known as "localhost" to Nagios
        '192.168.37.206' => 0,    # no reporting
);

Assuming your Nagios is already configured to accept passive checks, the corresponding service config will be needed on the Nagios side for hosts that are reported:

define service{
        use                     generic-service
        host_name               localhost
        service_description     fping_logger Connectivity
        check_command           return-unknown
        active_checks_enabled   0
        check_freshness         1
        freshness_threshold     60
    }

Any other things can be overridden too. See the beginning of the script for details. For example to change how often we ping to say 5 seconds:

$PINGTIME = 5000;

.... it's that easy.

And, being Perl, you can use loops to mop up loads of repetitive config if you have a lot of nodes:

foreach my (keys %NODES) {
    $UPDOWNCOMMAND{$_} = "/usr/bin/traceroute -n -I $_";
}

Putting it to use

Download: fping_logger Perl script 20140830

The basics are just to run it - ensure that the user you run it as has sufficient privilege (eg. if you are going to run "traceroute -I" then you will generally need to be root), and sit back and watch the stats going into the log files.

Log files should appear in the log directory (default /var/local/fping_logger) in .csv format so you can easily pull them into a spreadsheet for further analysis.

The main file is named in the format pinglog-DATESTAMP.csv and contains the ping times for nodes or NULL for failures. It will also contain any errors from fping with "#ERROR#" prefixing the first column, and any commands run on node status changes with the first column of "#COMMAND#". This should mean that where the interesting stuff is (NULLs) the error messages and analysis from external commands should be right below once the thresholds have been passed.

There are also running loss averages that are used by the Cacti script which are in files named in the format pingavgAVGTIME-DATESTAMP.csv and these are just running averages of the connectivity - ie. 0 means none, and 1 means prefect.

The script generates a PID file (/var/run/fping_logger.pid by default) so to stop it again you can just kill it with:

# kill `cat /var/run/fping_logger.pid`

If you want to do this via an init.d script (ie. start/stop automatically at boot/shutdown) then you can use this as the starting point:

Download: fping_logger init.d script

It should be enough to add this (making it executable) to /etc/init.d/, renaming as appropriate, and in Debian run: update-rc.d <script name> defaults

Adding Cacti

Often by graphing appropriate things, many problems start to show up before they start to affect a service badly enough for people to start complaining, so the thing here is to pull the stats into Cacti to graph them. A typical scenario you may find is that there are regularly low levels of loss at certain times of the day or week which may correspond to usage patterns or even things like some nasty bit of electrical kit causing interference, something overheating or any number of possibilities.

This setup assumes you are running Cacti on the same node (or have exported the filesystems) so that the Cacti scripts can read the log files. The Cacti Data Input Methods for these include the path to the logs so if you are not using the same ones as me then you had better edit the first argument in the Data Input Methods.

There are 2 scripts for Cacti and I have put them in my /usr/local/share/cacti/scripts directory and created the Data Input Methods to match. If you put them in a different location (or your logs in a different place to default) then you will need to alter the Data Input Methods so they match.

You will need to import the fping_logger Cacti Template bundle.

These have really taken a bit of time to get right - I keep on finding strange bugs that seem to affect them at time's I'm not around to work out what they are doing. The early hours of the morning is the favourite time for them to fail, but I think I've finally got something dependable enough to release. If you find it fails on you (gaps in the graphs) then it would be good hear if you can figure out why and what in the data was causing problems.

Now it's all fairly usual stuff, and all you need to do when adding the graphs is to add in the address that you have used (first line of header of the log file) to reference the particular host. After a couple of polls your graphs should start to fill and look something like these:

fping_logger latency graph

fping_logger 1 day average loss graph
 

fping_logger 1 day average loss graph

fping_logger 5 minute average loss graph

Loss Logic Revisited (20110619)

I've been having a long hard think about what the best way to handle combining the fact that a node can be down due to consecutive losses (eg. it was turned off) or high average packet losses (eg. it has a flaky/bad connection). This has come down to the old saying "it's simple to make things complex, but complex to make things simple".

There are a number of ugly interactions between the two methods - eg. if there is a hard loss (node completely unreachable) then the loss average also is quickly affected, but when it returns the consecutive pings may say it's up but the loss average may take much longer to recover. Likewise if it returns and is flaky/lossy on it's return and we have disabled the loss averaging while down due to hard loss then it may take the loss average time to reach the threshold where it determines that the node still does not have a level of connection quality that is acceptable.

After a lot of consideration I have decided that the best thing is that we follow different logic depending on the cause of the loss:

Initial hard loss

If the initial loss was a hard (consecutive pings) loss then we should limit the loss average to it's threshold (ie. not allow it past the threshold).

This ensures that not only do we not return from a hard loss directly into an average loss situation due to the average being skewed, but it ensures that when we do return, if the connection is lossy/flaky then we quickly go into an average loss situation again. This provides what I believe is a good compromise between sensitivity to faults and the risk of false loss situations being generated.

Initial average losses

Where the current loss was caused by a long term average then we should only return the node to an up state when we know that the average loss is acceptable again. We should also ignore the consecutive losses as the random (Ok - I know nothing like this is truly random!) or at least unpredictable behaviour of high packet loss faults may mean that there are periods of hard losses due to the same fault. There may be periods when we get sufficient consecutive pings to say the node is up, but if say we get 4 (default consecutive threshold) good pings followed by 3 losses repeatedly, the connection will still be unusable while based on consecutive pings it would be considered good despite the >40% packet loss.

The consecutive loss counter can continue as it reacts quickly and is reset to and counts from zero on any change. This means that if during the average loss situation the connection is lost completely, and then returns completely (eg. a misbehaving router is rebooted), the consecutive loss counters will be in a neutral state anyway, while if connectivity returns to a lossy state then the average losses continue to hold the state of the node down.

Limiting loss averages

The only thing I haven't been able to come to a definite decision on is if limits should be placed on the loss averages so that if they do fall way down that they don't take a long time to recover. If I do decide to limit the average then the next question becomes, "To what?"

I could try and base it off the thresholds. The "up" threshold naturally has a limit of zero packet loss and as the thresholds will be always close to the low-loss end of the range the natural high loss limit of 100% loss means slow recovery. Making the "down" limit the "down" threshold plus the "up" threshold makes some sense in that we put symmetrical limits round the hysteresis band.

The other limiting option is to pass the problem over to the user. I hate doing this as it's really saying "I couldn't figure out a good solution therefore I am now passing the problem over to the end user", and in reality the end user (especially in consumer software) probably has less of a chance of working out a good solution to the problem.

For now I'm going with the best compromise I can think of which is to use the symmetric limit described above, but allow the end user to over-ride it in the config. This adds $LOSSLIMIT as an optional default setting of the limit, and %NODELOSSLIMIT for per-node limits. If no per-node limit is set then $LOSSLIMIT is used if set otherwise a symmetric limit round the hysteresis band is used.

Limiting does have one significant downside - if the losses are periodic (ie. node goes hard up or hard down for sustained periods rather than a more regular packet loss pattern) then the limit could cause the average to recover excessively quickly resulting in loads of up and down reports. If you do need to protect against this then set "$LOSSLIMIT = 0;" in your config which effectively disables the limiting though there may be a partial compromise on the limit that is sufficiently low to cover common scenarios which may occur while not impacting recovery badly.

No perfect solution

The catch with all this is that in retrospect a failure can look very different. When you get a series of lost pings, is the node down? Have you got high packet loss? Then you get a successful ping - is the node up again? Is it just a lucky packet that got through the high packet loss?

Retrospectively you may see that the problem has cleared and there is no further packet loss, or the next 2 pings where lost showing that it was actually still in high packet loss.

Random events are by definition not predictable.

The fundamental thing here is that we only really know what happened well after the incident - there simply isn't enough information available at the time to work these things out immediately so what we are doing here is really just taking our best guess based on a few simple rules.

Update 20110622

I have to confess I released a badly buggy version that didn't initialise the state of nodes on the first cycle so one down/up cycle was needed before things started working properly. Seems to all be fixed now

Also, I have removed the use of unbuffer as it turns out it introduces a nasty problem - when pingers are recycled, the file descriptor is simply closed which causes a SIGPIPE to fping the next time it outputs and it bails on that. Unfortunately unbuffer continues on despite the SIGPIPE which means that when it is used, the pingers never recycle. I could do a whole load of messing about to find the PID of the all child process' - finding the PID of the shell it runs in is easy - it's returned by open but the PID of that requires dealing with process control and although Perl has a class for that, I can't be bothered so I'm just ditching unbuffer.

Another refinement is the availability of a config for hot file handles - when hot, data is written to files immediately so the log files will always have the very latest results in them which is mostly good for a script like this. Where disk writes are better avoided (eg. a filesystem running on FLASH memory) then buffering the output reduces writes massively and turning off hot file handles is a benefit.

Update 20110701

This is more of a refinement update. I have altered the order that things are done to accommodate suppression of email where an upstream node has failed. For example, if our immediate gateway router has failed, all hosts we are monitoring will also be marked as down. The result may be 20 emails arrive suddenly when we really only know that one node has failed.

By using %NODESUPRESSMAILIFDOWN to specify the immediate dependency, the script will now check all the dependent nodes that relate and suppress email sending if any of them are down. That means in the example above that when the gateway router fails, we get one mail about the gateway router only and all the 20 emails for nodes beyond that are suppressed.

The existing stuff like commands and logging continue unaffected. This does have the disadvantage that if other nodes fail while their dependencies are down, and the dependencies return, no email will be sent about the other node that has failed as the failure email will be suppressed.

Update 20110710

This is mainly a tidy-up release:

  • Switched the individual node monitoring to Object Orientated which removes many of the global hashes which have accumulated as the script has grown.
  • Mail suppression is now far more intelligent about suppressing mail hand can cope with the following scenarios:
    • Dependency dies before/with the Dependent nodes and recovers with/after - only the Dependency will be reported
    • Dependent node(s) die while then Dependency remains up for awhile longer and then dies - previously this resulted in the UP mail for the dependent nodes being suppressed, now mail for all nodes is sent. If there was a DOWN mail there will be a matching UP mail.
    • Dependency dies before/with the Dependent nodes (so DOWN mail gets suppressed) but Dependency recovers while Dependent node(s) do not. This implies that both failed with or during the outage and Dependent nodes have not yet recovered. As the DOWN mail was suppressed we now retrospectively send this with a warning that it was previously suppressed, and the UP mail is sent as usual.
  • Addition of $CONTINUEAVERAGES config option to disable continuation (reading of previous) averages on start up. When set to 0 (zero) this will clear (reset to zero) all the packet loss averages on start. This is useful you forget to stop the scripts when doing planned maintenance which would cause a prolonged outage and skew the averages.
  • Detection (hopefully) for out of order pings and automatic restart of fping to correct the situation

 

Comments:




Note: Identity details will be stored in a cookie. Posts may not appear immediately