1. The Qlustar HPC Core Stack
The HPC core stack is a Qlustar component that is related to the choice of Spack as the package manager for HPC applications on Qlustar clusters.
1.1. Motivation
Rather than aiming to provide as much HPC functionality as possible via OS packages (debs/rpms), the HPC core stack makes a clear cut between Qlustar-related system software (provided as OS packages) and HPC application software (to be provided by Spack). In practice, the borderline is made at the MPI level: All MPIs are to be provided by Spack, whereas the system-related dependencies of MPIs are provided by Qlustar packages integrated into Qlustar image modules.
This separation is necessary to assure that the system-side components like the workload manager and the Infiniband stack of a Qlustar cluster work correctly with the application software. Because of version and configuration incompatibilities between e.g. slurm or RDMA packages provided by Spack and the ones running on the Qlustar head-node/images, leaving these packages under Spack control would often lead to installed software not functioning properly or running inefficiently.
1.2. Components
Currently, the following packages are part of the Qlustar HPC core stack:
-
hwloc
-
rdma-core
-
slurm
-
pmix
-
ucx
-
Spack
These packages are provided with the same version on all supported Qlustar edge-platforms. This ensures that at the HPC application level, cluster users will feel no difference when working on different edge-platforms. Admins can therefore easily switch between them without disrupting user workflows if the need arises.
The packages of the core stack are declared as so-called external packages in the Qlustar Spack deb/rpm package. This way we make sure, that packages created by Qlustar Spack will work flawlessly on Qlustar clusters.
1.3. Release cycles
The HPC Core Stack has a shorter release cycle than Qlustar itself. It is coupled to the release cycle of slurm, hence currently it is 9 months. This guarantees that Qlustar always provides a slurm version that is still supported by upstream. The aim is to update to the most recent major slurm version whenever it is at the minor release level of .5 or .6, e.g. switch to the 23.02.x series as soon as 23.02.5/6 has been released. At this stage, all other packages of the core stack are also updated to current versions.