Product SiteDocumentation Site

Qlustar Cluster OS 11.0

Release Notes

Abstract

The 11.0 release updates Qlustar's core platform to current Ubuntu 18.04 LTS. The CentOS edge platform is now based on 7.6. As a result of our continuous platform optimization/simplification process we moved to dnsmasq as a replacement of the previously used ISC DHCP and atftp TFTP server. dnsmasq also provides cluster internal name services (DNS) replacing the NIS hosts map and acts as a DNS proxy.
In addition to the dnsmasq management interface, the second new major feature of QluMan is the possibility to handle Network filesystem resources. Initially this supports NFS mounts including RDMA connections and a mechanism to automatically choose the optimal network path to the NFS server. Mount resources are implemented as systemd automount units. This new interface replaces the previously used automount daemon which is now deactivated per default.
The highlights among the numerous component updates and bug fixes are: Kernel 4.19.x, Slurm 18.08.x, CUDA 10.1, OpenMPI 4.0.1, BeeGFS 7.1.3.
1. Basic Info
2. New features
2.1. dnsmasq
2.2. QluMan 11.0
3. Major component updates
4. Other notable package version updates
5. General changes/improvements
6. Update instructions
7. Changelogs

1. Basic Info

The Qlustar 11.0 release is based on Ubuntu 18.04.2. It includes all security fixes and other package updates published before June 13th 2019. Available security updates relevant to Qlustar 11.0, that have appeared after this date, will be announced on the Qlustar website and in the Qlustar security newsletter. Supported edge-platforms are Ubuntu 18.04 (Bionic) and CentOS 7.6 with integration of OpenHPC 1.3.8.

2. New features

2.1. dnsmasq

dnsmasq is now employed as a central Qlustar component to consolidate three network services previously provided by three independent daemons and hence significantly reduces complexity on the head-node. In addition it provides DNS proxy services per default which had to be configured manually in earlier Qlustar releases.
DHCP
dnsmasq now provides cluster-internal DHCP services replacing the previously used ISC DHCP server software.
TFTP
dnsmasq also acts as a TFTP/PXE boot server making the previously used atftpd obsolete.
Cluster-internal DNS
dnsmasq provides DNS name resolution both for cluster-internal nodes as well as external machines via its proxy functionality. As a consequence, the NIS host map is not needed anymore and has been removed for new installations.
Legacy support
When updating from an earlier release, you have the option to keep the previous DHCP/TFTP setup through a legacy option for now (see below).

2.2. QluMan 11.0

Network Mount Management
QluMan has a new Config Class that allows to configure and assign network mounts to nodes. Initially NFS mounts including RDMA variants are supported. They are setup on the nodes as systemd automount units.
QluMan automatically activates NFSoRDMA on clients that support it with an option to switch back to TCP if needed. It also allows to define priorities for the available networks, so that the network to be used for a mount is optimally chosen if a node provides multiple paths to the corresponding NFS server.
DNSmasq Management
The newly introduced dnsmasq service is fully manageable via QluMan. A new dialog has been introduced to add external DNS servers and search domains, as well as name/IPs for cluster external machines and other global network related settings.
Preview node-specific configs
The context menu of a node in the Enclosure View now includes an entry that allows a preview of all host-specific files/configs that are assigned and sent to a node when booting.

3. Major component updates

Kernel 4.19
Qlustar 11.0 is based on the 4.19 LTS kernel series (Ubuntu only). The new kernel option mitigations=off has been added to QluMan for an easy way to run compute nodes without performance penalties from CPU security bug mitigations (Spectre etc.).
Slurm
Qlustar 11.0 introduces the Slurm 18.08 series with the current version being 18.08.7.
OpenMPI
Qlustar 11.0 upgrades to OpenMPI 4.0.1 now using ucx transport per default. Furthermore, we added support for multiple gcc versions, now with a gcc7 flavor based on the Ubuntu default compiler (gcc 7.4.0) and a gcc8 flavor based on gcc 8.3.
Nvidia CUDA
Qlustar 11.0 provides optimal support for Nvidia GPU hardware by supplying pre-compiled and up-to-date kernel drivers as well as CUDA 10.1.
BeeGFS
Qlustar 11.0 has integrated the most recent BeeGFS release 7.1.3 for clients and servers with ready-to-use image modules.

4. Other notable package version updates

  • rdma-core: 21.0 (Ubuntu only, on CentOS, the original RHEL OFED stack is used).
  • Intel/PGI Compiler support: The Qlustar wrapper packages have been updated to support the new version of the Intel parallel studio (2019) and PGI community edition 2018.4/10 (package qlustar-pgi-dev-tools). Corresponding OpenMPI package variants for this compiler are also provided (both Ubuntu only).
  • Lustre: 2.12.2
  • ZFS: 0.7.13
  • singularity: 3.2.0
  • openblas: 0.3.5
  • hwloc: 2.0.3
  • ucx: 1.5.1
  • libpsm2: 11.2.68
  • OmniPath (OPA) stack: 10.8.0.0.201/2

5. General changes/improvements

  • The automount daemon together with the corresponding NIS maps have been replaced in favor of the new network mount config class provided by QluMan (see above). Legacy setups based on automount/NIS will continue to function and will be supported.

6. Update instructions

  1. Preparations

    Upgrading to Qlustar 11.0 is only supported from a 10.1.x release.

    Note

    Make sure that you have no unwritten changes in the QluMan database. If you do, write them to disk as described in the QluMan Guide before proceeding with the update.
  2. Optionally clone chroots

    Clone existing Ubuntu 16.04 and CentOS7 chroots based on 10.1 and then afterwards upgrade the clones to 11.0. That allows for easy rollback.
  3. Update to 11.0 package sources list

    The Qlustar apt sources list needs to be changed as follows both on the head-node(s) and in all existing Ubuntu based chroot(s) that should be updated.
    0 root@cl-head ~ #
    apt update
    0 root@cl-head ~ #
    apt install qlustar-sources-list-11.0
    
    To prepare a CentOS7 based chroot for the upgrade, change into it and execute the following:
    (centos7-11.0) 0 root@cl-head ~ #
    sed -i -e 's/10.1/11.0/g'  /etc/yum.repos.d/qlustar-10.1-centos7.repo
    
  4. Update packages

    On the head-node execute
    0 root@cl-head ~ #
    apt update
    0 root@cl-head ~ #
    apt dist-upgrade
    

    Note

    When asked about whether you want to update the configuration file for some package, you should answer 'N' (keep the old version) unless you have a specific reason to change it.
    Change into each Ubuntu based chroot you want to update (e.g.)
    0 root@cl-head ~ #
    chroot-bionic
    
    and also execute:
    (bionic) 0 root@cl-head ~ #
    apt update
    (bionic) 0 root@cl-head ~ #
    apt dist-upgrade
    
    Change into each CentOS based chroot you want to update (e.g.)
    0 root@cl-head ~ #
    chroot-centos7
    
    and execute (twice the same command):
    (centos7-11.0) 0 root@cl-head ~ #
    yum update
    (centos7-11.0) 0 root@cl-head ~ #
    yum update
    
    Confirm the import of the new Qlustar GPG key.
  5. Reboot head-node(s)

    Initially only reboot the head-node(s).
  6. Regenerating Qlustar images

    Regenerate your Qlustar images with the 11.0 image modules. To accomplish this, you'll have to select Version 11.0 in the QluMan Qlustar Images dialog. If you have new cloned chroots, select those as well.

    Note

    If your images include image modules that have a version in their name (e.g. beegfs-6-server), make sure that you change to the corresponding module with the most recent version (e.g. beegfs-7-server).
  7. Migration to dnsmasq

    Migration to dnsmasq is not absolutely required during this upgrade, but highly recommended. If you want to delay the migration for now, you can do so by checking the legacy checkbox in QluMan. In this case you still have to disable systemd-timesyncd though in order for ntp to work:
    0 root@cl-head ~ #
    service systemd-timesyncd stop
    systemctl disable systemd-timesyncd
    
    
    After this, you can reboot the head-node once more and proceed with step 8.

    Warning

    Support for the old legacy setup with separate DHCP and TFTP servers will be discontinued starting with the 11.1 release. So you'll have to do the migration sometime before updating to 11.1. To do so, just follow the remainder of this step at any time.
    To start the migration you first have to install dnsmasq and disable the old DHCP and TFTP server as well as some systemd services:
    0 root@cl-head ~ #
    apt install dnsmasq
    0 root@cl-head ~ #
    for s in isc-dhcp-server atftpd systemd-resolved systemd-timesyncd; do
      service $s stop
      systemctl disable $s
    done
    
    Now add at least one of your external DNS servers in the corresponding QluMan dialog and afterwards write the dnsmasq config.

    Warning

    This write step is essential. If you forget it, qlumand won't be able to start after a reboot and you'll be left with a system in an inconsistent state that needs manual repair.
    Also remove nis from the hosts entry in nsswitch.conf
    0 root@cl-head ~ #
    sed -ie 's/^hosts:.*/hosts:\t\tfiles dns/g' /etc/nsswitch.conf
    
    and change the dns-nameservers entry in /etc/network/interfaces to localhost, so that the head-node itself is using dnsmasq for DNS lookups.
    0 root@cl-head ~ #
    sed -ie 's/  dns-nameservers.*/  dns-nameservers localhost/g' /etc/network/interfaces
    
    Finally reboot the head-node once more. Once it is up again, check that dnsmasq is running:
    0 root@cl-head ~ #
    service dnsmasq status
    
    If everything is working as expected, you can remove the now obsolete packages:
    0 root@cl-head ~ #
    apt remove atftpd isc-dhcp-server
    
  8. Write config file changes

    To activate all remaining changes in the QluMan database that were introduced by the update, they need to be written to disk now. Check the QluMan Guide about how to write such changes.
  9. Reboot all netboot nodes

    After the regeneration of the images is complete, and all the above steps have been done, you can reboot all other nodes in the cluster, including virtual FE nodes. This completes the update procedure.

7. Changelogs

A detailed log of changes in the image modules can be found in the directories /usr/share/doc/qlustar-module-<module-name>-*-amd64-11.0.0. As an example, in the directory /usr/share/doc/qlustar-module-core-xenial-amd64-11.0.0 you will find a summary changelog in core.changelog, a complete list of packages with version numbers entering the current core module in core.packages.version.gz, a complete changelog of the core modules package versions in core.packages.version.gz and finally a complete log of changed files in core.contents.changelog.gz.