Prometheus

This section describes the Qlustar Prometheus setup.

Prometheus Components

Prometheus is an open-source systems monitoring and alerting toolkit. It collects and stores its metrics as time series data, i.e. metrics information is stored with the timestamp at which it was recorded, alongside optional key-value pairs called labels. The following Prometheus components are employed in the Qlustar monitoring stack.

Prometheus Server

The main Prometheus server scrapes and stores time series data provided by the so-called Prometheus Exporters (see below). On Qlustar it runs on the head-node and for security reasons, its web interface is only activated directly there. In order to access it on your local machine, you can use ssh port forwarding (ssh -L 9090:localhost:9090 root@<headnode>, where <headnode> is the hostname of the head-node), and then use http://localhost:9090 in the browser.

The configuration file /etc/prometheus/prometheus.yml is auto-generated by QluMan as the result of the configs created with the corresponding configuration dialogs.

Prometheus Exporters

The main exporter employed by Qlustar to gather node-specific data is the well-known node exporter. Custom metrics added by Qlustar are integrated into the node exporter as so-called text-file collectors.

The name of Qlustar metrics always start with the string ql_.

Depending on the purpose of a Qlustar cluster additional exporters may be employed. For HPC clusters with Slurm the Slurm Exporter is auto-configured and accompanied with a corresponding Grafana dashboard. For Qlustar Kubernetes clusters, the kube-state-metrics (KSM) exporter may be deployed as a K8s workload. It will then be scraped by the Prometheus Server and the data can be visualized by the Qlustar-supplied Grafana dashboard.

Prometheus Alertmanager

The Prometheus Alertmanager handles alerts sent by client applications such as the Prometheus server. It takes care of deduplicating, grouping, and routing them to the correct receiver integration such as email, on-call notification systems, and chat platforms. It also allows for silencing and inhibition of alerts. On a Qlustar cluster, the Alertmanager config /etc/prometheus/alertmanager.yml as well as the definition of alerts and their rules is all managed by the corresponding QluMan component.

Furthermore, custom Alertmanager e-mail templates are supplied and activated. Per default, all e-mail alerts are sent to root@localhost and then routed to the root mail aliases configured in /etc/aliases on the head-node(s) just like any other mail to the root user.