Menu
Index

Contact
Atom Feed
Comments Atom Feed

Similar Articles

2016-02-17 16:13
Simple Python Plugins
2011-06-01 15:27
Pinger improved (with Cacti)
2009-01-10 12:10
Squeezecenter IMMS Pluigin lives!

Recent Articles

2019-07-28 16:35
git http with Nginx via Flask wsgi application (git4nginx)
2018-05-15 16:48
Raspberry Pi Camera, IR Lights and more
2017-04-23 14:21
Raspberry Pi SD Card Test
2017-04-07 10:54
DNS Firewall (blackhole malicious, like Pi-hole) with bind9
2017-03-28 13:07
Kubernetes to learn Part 4

Glen Pitt-Pladdy :: Blog

Nagios debate

Nagios is a well established monitoring tool, widely deployed and well supported. It does failed/working monitoring as well as having plugins to do more. It's often deployed alongside Cacti which provides graphing for monitoring trends and planning data but has evolved also be used for failed/working monitoring functionality and much more via a wide range of plugins.

What is surprising is that many people I talk to in the industry think that Nagios has it's warts, yet serves it's purpose well while others think it should have been "aborted ages ago" and prefer other monitoring tools.

So what can be improved?

Execution overhead

Nagios provides a core scheduler and coordination of tasks and then relies on external executables (plugins) to perform checks. While this makes it extremely flexiable, it does also create significant overhead of having to spin up a new executable (and potentially an interpreter such as sh or Perl) for every check that happens which, especially if it's a shell script, may in turn spin up a number of additional processes.

I've seen even moderate sized operations where there are several thousand parameters that need to be checked, and by default Nagios will check every minute. That's several thousand, maybe tens or even hundreds of thousands of processes to spin up every minute, or several hundred to even thousands per second. Where caches get cleared during the check interval it then creates a load of disk activity and generally proves a huge overhead.

The result is that often Nagios will be configured to check much less regularly than every minute for all but the smallest infrastructures - say every 15minutes and that in turn means that breakages may go undetected for as long as 15 minutes (average being half that). By that time users will often be complaining about the failure before the monitoring does.

Rapid checking

Certain things benefit from more rapid checks than once a minute. For example, intermittent network problems or low level packet loss may benefit from pings every few seconds to catch the brief network disruption.

While it may be possible with Nagios, it's far from ideal. In many cases it may be necessary to have a daemon running to monitor and then use passive checks with Nagios to raise the alert. This is in fact what my ping monitoring tool does and has proven very useful for me on a number of occasions where suppliers monitoring purely with Nagios have failed to catch brief routing failures and the like.

State loss

As each check executes and terminates afterwards, any stored state gets lost. In many cases this is probably a good thing, but where stored state is useful such as when monitoring the rate of some count (eg. an error count on a switch port or a failed login counter), Nagios makes things difficult.

Monitoring plugins often end up storing their own state in files which have to be read and written each time the check executes (more overhead) or often foregoing true rate monitoring and raising alerts on an absolute count rather than the rate. That then means that an admin has to go and manually reset counters or similar and that it isn't exactly automated monitoring. This also creates a crying wolf situation further undermining the effectiveness of monitoring. When things become a hassle human nature is to avoid doing them which undermines the benefits of having automated monitoring in the first place.

Remote Checks

Nagios has various remote checking options but typically nrpe would be used to remotely execute plugins on the target node. Wisely, this requires the config for the check on the remote node and the request is made as a named check and the result returned. This minimises security risks as there is no remote input to the check.

Trouble here is that once again all remote checks are run as individual executions and have a high overhead. As the configuration is on the remote node anyway, it would be far better if it also handled all the checks on that node, offloading them from the Nagios server and just passed back a bundle of results and perhaps a heartbeat to ensure it was still alive. In many cases that could reduce the execution overhead on the Nagios server by an order of magnitude or better, allowing for much more frequent checks and quicker response.

Configuration

This has been one of the main gripes I have encountered with Nagios - the configuration is mostly text based and can get messy. While that's true in one sense, personally I am far from convinced with this one. While a web interface (eg. like Cacti and other tools) may have a much easier learning curve, GUI type interfaces also suffer from problems like lack of comments which are useful both for making other admins aware of important aspects of the configuration that may otherwise be misunderstood as well as being able to comment out config for testing and experimenting. This is especially useful when the complexity of comprehensively monitoring even moderate sized infrastructure where several thousand items may be checked and managing that is likely to be far more painful than a text based config once the initial learning curve is overcome.

Nagios configuration does often get messy, but this is not necessarily directly the fault of it's design. Nagios provides the option of being able to inherit config from templates or explicitly specified which allows much of the configuration to be handled in common within the templates. Since monitoring is something that evolves in response to failures and with the growth of infrastructure, it often lacks the up-front planning needed to keep the configuration tidy and as a result is an area that suffers badly from technical debt as new config gets tacked on in a hurry to monitor the latest pressing problem.

Management and engineering need to recognise this and ensure the time is spent refactoring the config and planning monitoring better. Almost any configuration or code can fall into this trap and monitoring is by nature particularly sensitive to it. While there are likely improvements in the configuration design that can help, I don't think this one is necessarily a fault of Nagios but more a common and easily made oversight of management and maintenance of monitoring systems.

Why am I thinking about this?

I've been toying with the idea of building a new monitoring platform with the checking plugins as classes which can support threading checks. This way there is no execution overhead (unless the plugin does depend on executing an external check - such as a Nagios compatibility plugin) and checks can be concurrent. As each check is an object, it also means it inherently has (volatile) state storage.

What also becomes possible is to provide a non-volatile (say MySQL) storage mechanism which can be inherited by each plugin class. This can be used for non-volatile storage as well as asynchronously passing of data between different elements of the system and making it easy to support many different checking schedules (even very rapid ones!)

Also, inheritance would make it easy to have common methods like various rate checking options, and possibly even giving every object it's own smart scheduler that can adapt to risks in real time. Much more is possible with this model.

Tempting project, but right now I have not fully worked out what is the best approach to many details of the design and have plenty other things on the go.