Nagios

This section describes the Qlustar Nagios setup.

Nagios Plugins

The package qlustar-nagios-plugins contains the tools required to process the data received from Ganglia. The thresholds, services and nodes to monitor and group definitions are set in files located in the directory /etc/nagios3/conf.d/qlustar. The file nodes.cfg lists the nodes. This file is auto-generated from QluMan. A few lines are required for each node as shown in this example:

define host {
  host_name      beo-01
  use            generic-node
  register       1
}
define host {
  host_name      beo-10
  use            generic-node
  register       1
}

The file hostgroup-definitions.cfg defines which nodes belong to which hostgroup:

define hostgroup {
  hostgroup_name        ganglia-nodes
  members               beo-0.
  register              1
}

The regular expression beo-0. specifies, that all nodes with a hostname matching the expression are member of this group. If you need to create additional groups because you have different types of nodes with a different set of available metrics, or with metrics that require different thresholds, then you can define them here. Example:

define hostgroup {
  hostgroup_name          opterons
  members                 beo-1[3-6]
  register                1
}

The file services.cfg lists all metrics that should be monitored. It includes the thresholds, and for which groups each service is defined. For cluster nodes, the metric data is delivered via Ganglia. The following example defines the monitoring of the fan speed for the members of the group opterons:

define service {
  use                      generic-service
  hostgroup_name           opterons
  service_description      Ganglia fan1
  check_command            check_ganglia_fan!3000!0!"fan1"!$HOSTNAME$
  register                 1
}

With this definition, the service will enter the warning state once the fan-speed goes below 3000, and if it completely fails (speed 0), it will enter the error state.

The following is an example for a service that is monitored for the members of two hostgroups:

define service {
  use                     generic-service
  hostgroup_name          ganglia-nodes,opterons
  service_description     Ganglia temp1
  check_command           check_ganglia_temp!50!60!"temperature1"!$HOSTNAME$
  register                1
}

Monitoring the head-node(s)

The file localhost.cfg lists the services that should be monitored for the head-node(s). The definitions are different because the data is not collected through Ganglia.

The software RAID (md) devices are monitored by mdadm and mail is sent to root if a device fails. The RAID devices are not monitored with the nagios setup by default.

Webinterface

You can open the Nagios web interface at the address http://<head-node>/nagios3/. Login as nagiosadmin. The password can be changed by executing the following command as root:

0 root@cl-head ~ #
htpasswd /etc/nagios3/htpasswd.users nagiosadmin

Restart

Nagios uses the information collected by Ganglia. In case this information source is not available, nagios will send warning mails. To avoid being flooded by these mails when you need to restart Ganglia, you should first stop Nagios:

0 root@cl-head ~ #
service nagios3 stop

Then you can restart Ganglia:

0 root@cl-head ~ #
service ganglia-monitor restart
0 root@cl-head ~ #
service gmetad restart

After restarting Ganglia on the head-node you need to restart Ganglia on the compute nodes as well (this shows how to do it on the cmdline, you can also use the QluMan RXengine for this, possibly creating a pre-defined command):

0 root@cl-head ~ #
dsh -a service ganglia-monitor restart
0 root@cl-head ~ #
dsh -a service gmetric stop
0 root@cl-head ~ #
dsh -a service gmetric start

Finally you can start Nagios again

0 root@cl-head ~ #
service nagios3 start