Lusis Development

Graphing solutions for Nagios

Nagios
Posted by John Vincent (jv) on Mar 28 2008 at 2:15 PM
/dev/log >> Nagios

I just did my first setup of a nagios 3 install and in the process I came across a new graphing solution that is the successor (of sorts) to perfparse.

About perfparse, I liked it as a whole but what I did not like was the fact the it stored EVERYTHIGN in an RDBMS (MySQL). Sure it makes for easy selecting of the data but it lacked some important features like multiple datasources per graph and the fact that the graphs we're just non-standard. Say what you will about rrd files, but it's pretty much the DEFACTO standard for storing and graphing of performance data. The generated graphs look clean, the rra files are relativley small (depending on how much data you want to keep) and most importantly they NEVER grow beyond the initial size. I can keep 5 years of data in an RRD to varying precision levels and the file will never grow beyond the initial file size. The sliding aggregation is amazing.

Couple that with the fact that perfparse development appears to have all but died and I'm left looking for a graphing solution that was as easy and transparent as perfparse. I gave nagiosgrapher a shot but it just didn't appeal to me. Then I came across pnp4nagios.

pnp4nagios stands for PNP is not Perfparse.  A quote from the homepage:

"During development of PNP we set value on easy installation and little maintenance while running it. An administrator should do other things than configure graphing tools."

I couldn't agree more.

While I love the fact that Ethan has made the data in Nagios available to external processes via perfdata, something has always felt screwy about USING that information.

I understand that Nagios is first and foremost a monitoring solution. However monitoring is only one part of the global picture of system administration. There's also the planning aspect. Sure you can say "Hey boss we've had 30 alerts over the past month about running out of disk space but they're starting to occur more frequently. Can we get more disks?".

Of course the boss is going to want to know not only how MUCH disk space but how long it will last. While you could throw out a number that was "extracted from the postierior region", wouldn't it be nicer if you actually had some sort of intelligence around your answers?

Most often this is where a graphing addon for Nagios comes into play. By storing performance data and graphing against it, you can not only see how much disk space growth you've had but how long it took you to get there. By looking at a graph, you can see that you're now running at 60% utilization on a resource but that it took you a year to get there. Maybe you see a spike in usage and when the spike drops back down, it never returns to the base value it spiked from. From there you can say "Every night at 12AM we spike from X% disk usage to 65% and then it drops down to about 5% above where it was before the spike. At this rate, we'll be out of disk space in X number of days." This is something that anyone can understand.  When IT folks just ASK for stuff without providing any justification, it expends a lot of personal capital with upper management. So you see, trending is just as if not more important than knowing when something has a problem. Through judicious trending and historical information intelligence, you can prevent future pages.

But back to pnp4nagios. Of all the graphing tools I've ever set up, this one was by far the easiest. Many tools required explicit start up orders (because they used a FIFO to pass data from nagios to the perfdata handler for performance) and would result in lost data should the performance processor crash. At CLA, we actually had a nagios check just for perfparse so that we would know when it died because we would lose so much after-hours intelligence in our trending if we didn't catch it right then. You could use a script to handle the performance data but that solution just didn't scale in the long term. This could actually slow down nagios itself. The only other option was to write the data to a file and process it externally. Depending on how long it took for the script to process the log file this solution would fail to scale as well.

PNP (http://www.pnp4nagios.org/pnp/start) has a few option that for doing the processing of performance data. One is using a command (a perl script) to process the performance data. This allows realtime updates to the graphs but suffers from the fact that it doesn't scale. There's a bulk mode: nagios writes performance data to a file and then, using "*_perfdata_file_processing_interval" options, fires off the parser every X seconds. This option can cause blocking to nagios and prevent new checks until it runs.

The option I like most is the Bulk mode with daemon option. This option still has nagios writing to timestamped files containting performance data. However, it provides a background daemon written in C that checks the directory periodically and processes the files with the perl script. This one provides the best performance if you don't need realtime graphing.

But what about the mission statement that adminsitrators should spend time creating graphs? This part is easy as well. Once you have pnp started processing the data in your particular mode, you don't really have to do anything else. You can write your own processor for graphing the RRD files or use the built in php grapher to do that.

How is this accomplished? Let's look at the heading of two rrd files created by pnp for two different services:

filename = "Check_home_partition.rrd"
rrd_version = "0003"
step = 60
last_update = 1206730446
ds[1].type = "GAUGE"
ds[1].minimal_heartbeat = 8640
ds[1].min = NaN
ds[1].max = NaN
ds[1].last_ds = "2051"
ds[1].value = 1.2306000000e+04
ds[1].unknown_sec = 0

vs.

filename = "Check_system_Loadavg.rrd"
rrd_version = "0003"
step = 60
last_update = 1206730496
ds[1].type = "GAUGE"
ds[1].minimal_heartbeat = 8640
ds[1].min = NaN
ds[1].max = NaN
ds[1].last_ds = "0.160"
ds[1].value = 8.9600000000e+00
ds[1].unknown_sec = 0
ds[2].type = "GAUGE"
ds[2].minimal_heartbeat = 8640
ds[2].min = NaN
ds[2].max = NaN
ds[2].last_ds = "0.150"
ds[2].value = 8.4000000000e+00
ds[2].unknown_sec = 0
ds[3].type = "GAUGE"
ds[3].minimal_heartbeat = 8640
ds[3].min = NaN
ds[3].max = NaN
ds[3].last_ds = "0.070"
ds[3].value = 3.9200000000e+00
ds[3].unknown_sec = 0

For the most part, they look exactly the same yet they monitor two totally different items. PNP uses XML files created at the time the RRD is created to define which name should be applied to a specific DS. This means that in the first example, DS[1] is actually labeled as the partition name that we're checking in the service check. For the second example, DS[1],[2],[3] are labled as load1,load5,load15 respectively. This means you don't have to spend time defining what the RRD should look like before you try to put data into it.

Now when it comes to graphing, there are plenty of predefined templates that you can use. In fact there are multiple templates already created for the most common services (check_load,check_tcp,check_disk and others) that you can simply symlink from templates.dist to templates under the name of the service check you've defined. In this case, I have a templates/checkdisk-dummy.php which is a symlink to templates.dist/check_disk.php.

The hardest aspect of the install was getting the templates just the way I wanted them but that didn't stop me from starting to take in the performance data as soon as possible so that when I DID have the graphs looking the way I wanted (beyond the really nice defaults), I already had 30 minutes of data in an RRD that I could use.

The biggest upside to all this was that with the deprication of extinfo from Nagios in version 3, I could now have icons to all my graphs for all hosts with a one line addition:

define host{
    name            generic-appserver-host
    register        0
    max_check_attempts    3
    notification_interval    120
    notification_period    24x7
    notification_options    d,u,r,f
    check_command        check-host-alive-dummy
    hostgroups        appservers
    contact_groups        testgroup
    action_url        /nagios/pnp/index.php?host=$HOSTNAME$
}

I could also to the same to my service template definition and have it applied to all hosts that have that service check:

define service {
    name            loadavg
    service_description    Check system Loadavg
    check_command        checkload-dummy!3,3,3!5,5,5
    max_check_attempts    3
    normal_check_interval    5
    retry_check_interval    120
    check_period        24x7
    notification_interval    120
    notification_period    24x7
    notification_options    w,u,c,r,f
    register        0
    contact_groups        testgroup
    action_url        /nagios/pnp/index.php?host=$HOSTNAME$&srv=$SERVICEDESC$
}

I'm going to be writing more on Nagios now that I'm back to using it more but I wanted to get some notes out there for people allready.

Back