Product SiteDocumentation Site

Qlustar Cluster OS 10.0

Release Notes

Abstract

The 10.0 release starts a new era for Qlustar, since it marks the beginning of its 100% Open Source future. It ships version 10.0 of the management software QluMan which is absolutely free in any sense now being licensed under the GNU Affero public license. The QluMan messaging framework QluNet has been put under the GNU Lesser Public license.
On the feature side, we introduce workstation management as a whole new use case for Qlustar. Based on the XFCE desktop environment, you can now manage workstations as easy as cluster nodes.
The highlight of QluMan 10.0 is a powerful and flexible new network management interface. Under the hood, QluMan's messaging framework QluNet was revamped to use Google’s Protocol Buffers as the data exchange format for QluMan messages. Protocol Buffers are a way of encoding structured data in an efficient yet extensible format and provide a strong foundation for QluMan's future.
As a result of Q-Leap development on hardware sponsored by Intel, Qlustar 10 introduces out-of-the-box support for the OmniPath network technology. All components necessary to setup and operate an OmniPath network including the fabric manager are available as Qlustar packages.
The highlights among the numerous component updates and bug fixes are: Kernel 4.9.x, Slurm 17.02.7, CUDA 9.0, BeeGFS 6.18, Lustre 2.10.3 and ZFS 0.7.5.
1. Basic Info
2. New features
2.1. QluMan 10.0
2.2. Workstation Management
2.3. Support for OmniPath networks
2.4. LDAP integration with sssd
3. Major changes
4. Major component updates
5. Other notable package version updates
6. General changes/improvements
7. Update instructions
8. Changelogs

1. Basic Info

The Qlustar 10.0 release is based on Ubuntu 16.04.4. It includes all security fixes and other package updates published before April 12th 2018. Available security updates relevant to Qlustar 10.0, that have appeared after this date, will be announced on the Qlustar website and in the Qlustar security newsletter. For now, Ubuntu 16.04 (Xenial) is the only edge platform, but at least one additional one will be added with Qlustar 10.1.

2. New features

2.1. QluMan 10.0

Network management interface
The new QluMan network management interface introduces a flexible way to manage the networking aspects of cluster nodes and also nodes outside of the cluster (e.g. workstations). It provides a clean abstraction between network definitions and their configuration on specific nodes. Check the corresponding section of the QluMan guide to read about the details.
QluNet Messaging Framework
QluNet has been largely redesigned to use Google’s Protocol Buffers as the underlying messaging format. This introduces a high degree of type-safety for the components of a message and as a consequence leads to improved efficiency and stability. With this change, QluMan's messaging stands on firm footing for the future. Along with this redesign, a number of additional improvements have been made to enhance QluMan's messaging reliability.
Miscellaneous
  • Global property and config sets have been introduced for a convenient way of adding properties or configurations that should apply to all nodes (more details).
  • Since Qlustar is free and open-source now, there is no license management/checking anymore. Instead, upon the first start of QluMan, you're required to accept the new licensing terms.
  • Highly improved checking on the consistency/validity of entered information in the Enclosure View.

2.2. Workstation Management

Qlustar now supports to run workstations like cluster nodes. There are four new Qlustar image modules: ws-base, ws-nvidia, ws-xfce and ws-lxde. By defining an image that contains the modules ws-base and ws-xfce, adding it to a Boot Config and assigning the latter to a group of workstations, you can boot them from the Qlustar head-node and provide a unified, nice and clean XFCE desktop experience to your users. While XFCE is the default and suggested desktop environment, LXDE can be used as an alternative.

2.3. Support for OmniPath networks

Qlustar now fully supports the OmniPath network technology. The Qlustar kernel provides the hardware drivers for the OmniPath adapters and up-to-date firmware comes from our linux-firmware package. Our new packages opa-ff and opa-fm provide fabric tools and the fabric manager daemon which is required for message routing in an OmniPath network. Finally, the psm2 user-space library was packaged to allow high-level access to the hardware. Qlustar OpenMPI is built against libpsm2 and hence provides out-of-the-box support for OmniPath networks.

2.4. LDAP integration with sssd

The sssd daemon is now used if LDAP integration is required for the cluster. The sssd config file /etc/sssd/sssd.conf of the head-node is automatically imported into generated Qlustar images, if one is found. So all you have to do is configure sssd correctly for the head-node, then all other nodes will have a functioning configuration as well, provided they can reach the LDAP server specified in the sssd configuration file.

3. Major changes

Systemd
Along with Ubuntu 16.04, Qlustar 10 has adopted systemd as the boot system. Most important services have been ported to systemd. One notable change is the use of predictable network interface names.
Netboot node boot process
The change to systemd also required a rethinking of the boot process for netboot nodes. With systemd, it is basically impossible to achieve a clean dynamic customization of the node configuration during the boot process after systemd has started. Hence, we added a pre-systemd boot phase, where most of the dynamic node customization (including disk setup) is taking place. During this phase, the QluMan execd of a node is started in a one-shot fashion to connect to the head-node and to receive the dynamic node configuration options. After this phase, systemd is started and the normal boot process commences.
SSH keys
For security reasons, Ubuntu 16.04 and hence Qlustar 10 has disabled ssh dsa keys. If you still have dsa keys (e.g. for root on the head-node), you need to generate a new key (we recommend ed25519 type keys). Don't forget to add the new root public key to the ssh authorized-keys config in QluMan, otherwise password-less login to your cluster nodes will not work.

4. Major component updates

Kernel 4.9
Qlustar 10.0 is based on the 4.9 LTS kernel series. Qlustar kernels include full meltdown/spectre mitigation. For admins who prefer to run without meltdown/spectre mitigation on internal cluster nodes for performance reasons, the necessary kernel cmdline options (nopti / spectre_v2=off) have been added to the QluMan BootConfig dialog for convenient and straightforward inclusion.
Slurm
Qlustar 10.0 introduces the Slurm 17.2 series with the current version being 17.2.7.
OFED Infiniband stack
Qlustar 10.0 ships the recently released OFED 4.8-2 Infiniband stack including all necessary components.
OpenMPI
Qlustar 10.0 upgrades to OpenMPI 2.0.5 including native support for OmniPath.
Nvidia CUDA
Qlustar 10.0 provides optimal support for Nvidia GPU hardware by supplying pre-compiled and up-to-date kernel drivers as well as CUDA 9.0.
BeeGFS
Qlustar 10.0 has integrated the most recent BeeGFS release 6.18 for clients and servers with ready-to-use image modules. Furthermore, a new startup mechanism has been implemented for clean systemd integration even when using multi-target configurations. As a result, ZFS backed BeeGFS deployments on Qlustar are extremely simple and robust.
Lustre
Qlustar 10.0 ships the most recent Lustre 2.10.3 release for clients and servers with ready-to-use image modules. This allows really easy deployment of this ultra-fast parallel filesystem backed by ZFS on your storage cluster.
ZFS Filesystem
Qlustar 10.0 updates ZFS to version 0.7.5 incorporating a huge number of improvements and bug-fixes.

5. Other notable package version updates

  • Intel/PGI Compiler support: The Qlustar wrapper packages have been updated to support the new versions Intel Parallel Studio 2018 and full support for the PGI community edition 17.10 (package qlustar-pgi-dev-tools). Corresponding OpenMPI package variants for these compilers are also provided.
  • ganglia: 3.7.2
  • singularity: 2.4.5
    Singularity is now installed per default in generated UnionFS chroots and hence available on all Compute-/FrontEnd nodes.
  • hwloc: 1.11.7

6. General changes/improvements

  • Node OS image size: The image size of generated Qlustar node OS images has been substantially reduced by up to 250MB for some combination of image modules.
  • Disk-less mode: For disk-less nodes, additional tmpfs filesystems have been added to make sure that the root RAMdisk doesn't fill up.
  • Installer: Removed torque option. Torque can still be installed manually later on.

7. Update instructions

  1. Preparations

    Upgrading to Qlustar 10.0 is only supported from a 9.2.x release.

    Note

    Make sure that you have no unwritten changes in the QluMan database. If you do, write them to disk as described in the QluMan Guide before proceeding with the update.
  2. Update qluman-common-3

    0 root@cl-head ~ #
    apt-get update
    0 root@cl-head ~ #
    apt-get install qluman-common-3
    
  3. Update root ssh keys

    As mentioned above, ssh dsa keys are not usable anymore with Qlustar 10. Hence, if you still rely on a dsa key you need to generate a new one (we suggest ed25519 type):
    0 root@cl-head ~ #
    ssh-keygen -t ed25519 -N ''
    
    Add the content of ~/.ssh/id_ed25519.pub to the ssh authorized-keys config in QluMan and remove old dss entries. Then write the ssh config.
  4. Stop execd on nodes

    Since the QluNet messaging protocol has changed in Qlustar 10, old execds won't be able to connect to the new qluman-router daemon, but will keep trying and hence will overload qluman-router with bogus requests. To prevent this, execd should be stopped on all netboot nodes before upgrading.
  5. Clone chroots

    It is advisable to clone existing Ubuntu 14.04 chroots (e.g. trusty -> xenial) and then afterwards upgrade the clones to 16.04.
  6. Update to 10.0 package sources list

    The Qlustar apt sources list needs to be changed as follows both on the head-node(s) and in all chroot(s) that should be updated.
    0 root@cl-head ~ #
    apt-get update
    0 root@cl-head ~ #
    apt-get install qlustar-sources-list-10.0
    
  7. Update packages

    Now proceed as explained in the Qlustar Administration Manual. However, if the OFED package is installed, you should manually execute the following, before executing apt-get dist-upgrade
    0 root@cl-head ~ #
    apt install ofed
    0 root@cl-head ~ #
    dpkg --purge libcxgb4
    0 root@cl-head ~ #
    apt-get -f install
    
    Now purge two other packages that will cause trouble if installed.
    0 root@cl-head ~ #
    for p in libcxgb3-1 texlive-latex-recommended-doc; do
      dpkg -l | grep -q $p && dpkg --purge $p
    done
    
    After this the normal update procedure should work fine.
  8. Enable qluman systemd services

    0 root@cl-head ~ #
    for srv in server router slurmd dhcpscanner execd; do \
      systemctl enable qluman-${srv}.service; done
    
  9. Enable other necessary systemd services

    0 root@cl-head ~ #
    systemctl enable rpc-statd
    
    If you have a VM FrontEnd node:
    0 root@cl-head ~ #
    systemctl enable qlustar-fe-node-vm
    
  10. Fix ganglia/jobmond

    0 root@cl-head ~ #
    apt install libapache2-mod-php7.0 qlustar-ganglia-theme
    0 root@cl-head ~ #
    sed -i -e 's|lib/ganglia|lib/x86_64-linux-gnu/ganglia|' \
      /etc/ganglia/gmond.conf
    0 root@cl-head ~ #
    rm -f /etc/jobmond.conf
    
  11. Reboot head-node(s)

    Initially only reboot the head-node(s).
  12. Check that MariaDB is running

    On older installations, it might be necessary to cleanup MariaDB boot scripts. If MariaDB is not running and /etc/init.d/mysql is a symbolic link, proceed as follows:
    0 root@cl-head ~ #
    rm -f /etc/init.d/mysql
    0 root@cl-head ~ #
    mv /etc/init.d/mysql.dpkg-new /etc/init.d/mysql
    
    Then reboot and check that MariaDB has started automatically.
  13. Regenerating Qlustar images

    Regenerate your Qlustar images with the 10.0 image modules. To accomplish this, you'll have to select Version 10.0 in the QluMan Qlustar Images dialog. If you have new cloned chroots, select those as well.
  14. Write config file changes

    To activate all changes in the QluMan database that were introduced by the update, they need to be written to disk now. Check the QluMan Guide about how to write such changes. A number of changes (diffs) in the following configs are expected due to reformatting: DHCP and dsh indentation, NIS hosts, ssh_known_hosts and shosts.equiv, Nagios.
    Due to the new QluMan networking configuration setup, head-node hosts will be removed from NIS hosts and ssh headers. Also note that the path for CgroupMountpoint in the Slurm cgroup.conf has changed to /sys/fs/cgroup and the CgroupReleaseAgentDir parameter has become obsolete.
    If you have netboot nodes with a special network setup, like e.g. a FrontEnd node that has an external interface, you will have to manually add the additional network information as explained here. Only the basic networks like the boot and IB as well as IPMI networks are automatically migrated to the new framework.
  15. Reboot all netboot nodes

    After the regeneration of the images is complete, and all the above steps have been done, you can reboot all other nodes in the cluster, including virtual FE nodes. This completes the update procedure.

8. Changelogs

A detailed log of changes in the image modules can be found in the directories /usr/share/doc/qlustar-module-<module-name>-*-amd64-10.0.0. As an example, in the directory /usr/share/doc/qlustar-module-core-xenial-amd64-10.0.0 you will find a summary changelog in core.changelog, a complete list of packages with version numbers entering the current core module in core.packages.version.gz, a complete changelog of the core modules package versions in core.packages.version.gz and finally a complete log of changed files in core.contents.changelog.gz.