QluMan Monitoring

This section describes the QluMan monitoring infrastructure. It consists of different components. The initial implementation (milestone 1.0) provides a filesystem-based interface to define checks determining the health state of a node and triggers reacting to health state changes.

The qluman-execd server running on each node, periodically executes monitoring checks and updates the health state of the node in the state file /run/qluman/node-health. If the health state of any check changes, qluman-execd looks if any triggers for the changed category are defined and executes the ones found.

Some monitoring checks/triggers are automatically provided by the Qlustar OS images to monitor essential generic services like time synchronization and a trigger to coordinate node health with the node state in the slurm workload manager. To extend the provided base functionality, QluMan also allows admins to write their own monitoring checks and trigger scripts.

Monitoring checks

Monitoring checks are files ending in .check with a certain syntax. They must be located either at /usr/lib/qluman/monitoring for pre-defined checks from Qlustar OS Images or at /etc/qlustar/common/monitoring for user-defined checks. Each file contains a list of key=value pairs that define and configure the check.

A user-defined check must start with type=custom to tell qluman-execd what kind of check it is. Other types are reserved for built-in checks. For a custom check, the following key=value pairs are defined:

  1. categories=…​

    The categories key declares a comma separated list of categories the check applies to. Categories can be user-defined, but must be a valid filename. Any category used in a check will be listed in the Summary section of the node-health file. They are used to trigger category-specific actions when the health state of a node changes.

    Each category is healthy only, if all checks listing it return success. If any of these checks fails, the category’s state changes to unhealthy and the error text from the failed check(s) is added to the category’s failure reason.

  2. interval=<seconds>

    The interval key determines how often a check is run. It also determines the time a check may take before it is considered stuck. If a check runs longer than the configured interval, it is marked as failed with failure reason timeout. A new check will not be launched before the previous run of the check will have completed.

  3. command=<path>

    For custom checks there is no pre-defined code to run. Instead, the user declares a command to be executed each time the check is to be run. Commands are launched using bash -c and can be a simple shell expression or the name of a binary/script. The exit code of the command is used as the health status of the check and in case of failure, the output, if any, is used as the failure reason.

Example: A check affecting slurmd that always succeeds

type=custom
categories=slurmd
interval=10
command=true

Example: A check that fails for no reason

type=custom
categories=useraccess
interval=10
command=echo "A guaranteed failure"; false

Health triggers

qluman-execd runs all health checks periodically as defined by each check and summarizes the results by the categories defined in them. Every time the status of a category changes, the corresponding health triggers in the directory /etc/qlustar/common/monitoring/<category>.triggers/ are executed.

To create a health trigger, simply create the corresponding triggers directory and place the trigger there. It is recommended to place symlinks into the directory instead of the triggers themselves.

A trigger file must be executable.