Prometheus
This section describes the Qlustar Prometheus setup.
Prometheus Components
Prometheus is an open-source systems monitoring and alerting toolkit. It collects and stores its metrics as time series data, i.e. metrics information is stored with the timestamp at which it was recorded, alongside optional key-value pairs called labels. The following Prometheus components are employed in the Qlustar monitoring stack.
Prometheus Server
The main Prometheus server scrapes and stores time series data provided by the so-called
Prometheus Exporters (see below). On Qlustar it runs on the head-node and for security reasons,
its web interface is only activated directly there. In order to access it on your local
machine, you can use ssh port forwarding (ssh -L 9090:localhost:9090 root@<headnode>
, where
<headnode>
is the hostname of the head-node), and then use http://localhost:9090
in the
browser.
The configuration file /etc/prometheus/prometheus.yml
is auto-generated by QluMan as the
result of the configs created with the corresponding
configuration dialogs.
Prometheus Exporters
The main exporter employed by Qlustar to gather node-specific data is the well-known node exporter. Custom metrics added by Qlustar are integrated into the node exporter as so-called text-file collectors.
The name of Qlustar metrics always start with the string |
Depending on the purpose of a Qlustar cluster additional exporters may be employed. For HPC clusters with Slurm the Slurm Exporter is auto-configured and accompanied with a corresponding Grafana dashboard. For Qlustar Kubernetes clusters, the kube-state-metrics (KSM) exporter may be deployed as a K8s workload. It will then be scraped by the Prometheus Server and the data can be visualized by the Qlustar-supplied Grafana dashboard.
Prometheus Alertmanager
The Prometheus Alertmanager handles alerts sent by client applications such as the Prometheus
server. It takes care of deduplicating, grouping, and routing them to the correct receiver
integration such as email, on-call notification systems, and chat platforms. It also allows for
silencing and inhibition of alerts. On a Qlustar cluster, the Alertmanager config
/etc/prometheus/alertmanager.yml
as well as the definition of alerts and their rules is all
managed by the corresponding QluMan
component.
Furthermore, custom Alertmanager e-mail templates are supplied and activated. Per default, all
e-mail alerts are sent to root@localhost and then routed to the root mail aliases configured in
/etc/aliases
on the head-node(s) just like any other mail to the root user.