Nagios
This section describes the Qlustar Nagios setup.
Nagios Plugins
The package qlustar-nagios-plugins contains the tools required to process the data received
from Ganglia. The thresholds, services and nodes to monitor and group definitions are set in
files located in the directory /etc/nagios3/conf.d/qlustar
. The file nodes.cfg
lists the
nodes. This file is auto-generated from QluMan. A few lines are required for each node as shown
in this example:
define host { host_name beo-01 use generic-node register 1 } define host { host_name beo-10 use generic-node register 1 }
The file hostgroup-definitions.cfg
defines which nodes belong to which hostgroup:
define hostgroup { hostgroup_name ganglia-nodes members beo-0. register 1 }
The regular expression beo-0.
specifies, that all nodes with a hostname matching the
expression are member of this group. If you need to create additional groups because you have
different types of nodes with a different set of available metrics, or with metrics that
require different thresholds, then you can define them here. Example:
define hostgroup { hostgroup_name opterons members beo-1[3-6] register 1 }
The file services.cfg
lists all metrics that should be monitored. It includes the thresholds,
and for which groups each service is defined. For cluster nodes, the metric data is delivered
via Ganglia. The following example defines the monitoring of the fan speed for the members of
the group opterons
:
define service { use generic-service hostgroup_name opterons service_description Ganglia fan1 check_command check_ganglia_fan!3000!0!"fan1"!$HOSTNAME$ register 1 }
With this definition, the service will enter the warning state
once the fan-speed goes below
3000, and if it completely fails (speed 0), it will enter the error state
.
The following is an example for a service that is monitored for the members of two hostgroups:
define service { use generic-service hostgroup_name ganglia-nodes,opterons service_description Ganglia temp1 check_command check_ganglia_temp!50!60!"temperature1"!$HOSTNAME$ register 1 }
Monitoring the head-node(s)
The file localhost.cfg
lists the services that should be monitored for the head-node(s). The
definitions are different because the data is not collected through Ganglia.
The software RAID (md) devices are monitored by mdadm and mail is sent to root if a device fails. The RAID devices are not monitored with the nagios setup by default. |
Webinterface
You can open the Nagios web interface at the address http://<head-node>/nagios3/
. Login as
nagiosadmin. The password can be changed by executing the following command as root:
0 root@cl-head ~ # htpasswd /etc/nagios3/htpasswd.users nagiosadmin
Restart
Nagios uses the information collected by Ganglia. In case this information source is not available, nagios will send warning mails. To avoid being flooded by these mails when you need to restart Ganglia, you should first stop Nagios:
0 root@cl-head ~ # service nagios3 stop
Then you can restart Ganglia:
0 root@cl-head ~ # service ganglia-monitor restart 0 root@cl-head ~ # service gmetad restart
After restarting Ganglia on the head-node you need to restart Ganglia on the compute nodes as well (this shows how to do it on the cmdline, you can also use the QluMan RXengine for this, possibly creating a pre-defined command):
0 root@cl-head ~ # dsh -a service ganglia-monitor restart 0 root@cl-head ~ # dsh -a service gmetric stop 0 root@cl-head ~ # dsh -a service gmetric start
Finally you can start Nagios again
0 root@cl-head ~ # service nagios3 start