SLURM for VE We have developed an initial implementation of SLURM for VE using the gres feature and NEC MPI for the SLURM. Currently we have the following restrictions: * Only pure VE jobs are supported. * Jobs with VE offloading or VH calls are not supported. * The --overcommit option shall be always specified. * An identical number of VE nodes are always allocated on every VH * allocated for a job. * Non-uniform process distribution is not supported; that is, * VE processes are always distributed onto VHs and then * VE nodes on each VH uniformly. Getting started (A)System Requirements (a)The following operating systems have been tested. CentOS Linux 7.6 CentOS Linux 8.3 Red Hat Enterprise Linux 8.4 (b)One server host and one or more execution hosts with VE nodes each The server host manages all the execution hosts on which jobs are executed. The server host also serves as an execution host. (B) Installation (a) Server Host Perform all the steps as the root user unless otherwise stated. (1) Install SLURM version 20.11.5, referring to https://slurm.schedmd.com/quickstart_admin.html (2) Stop SLURM, if it is running, as follows: $ systemctl stop slurmctld $ systemctl stop slurmdbd (3)Uninstall the SLURM package as follows, which causes packages depending on it to be uninstalled: $ yum remove slurm Note that the package slurm-example-configs shall not be removed. (4)Create the SLURM for VE rpm files from the file slurm-20.11.5.tar.bz2 as a normal user. $ rpmbuild -ta slurm-20.11.5.tar.bz2 (5)Install the rpm files created at the previous step $ cd ~/rpmbuild/RPMS/x86_64/ $ yum localinstall slurm-20.11.5-1.el7.x86_64.rpm $ yum localinstall slurm-slurmctld-20.11.5-1.el7.x86_64.rpm $ yum localinstall slurm-slurmdbd-20.11.5-1.el7.x86_64.rpm $ yum localinstall slurm-perlapi-20.11.5-1.el7.x86_64.rpm (6)Start SLURM $ systemctl start slurmctld $ systemctl start slurmdbd (b) Execution Hosts Perform all the steps as the root user unless otherwise stated. (1) Install SLURM version 20.11.5, referring to https://slurm.schedmd.com/quickstart_admin.html (2) Stop SLURM, if it running, as follows: $ systemctl stop slurmd (3) Uninstall the SLURM package as follows, which causes packages depending on it to be uninstalled: $ yum remove slurm Note that the package slurm-example-configs shall not be removed. (4) Create the SLURM for VE rpm files from the file slurm-20.11.5.tar.bz2 as a normal user. $ rpmbuild -ta slurm-20.11.5.tar.bz2 (5)Install the rpm files created at the previous step $ cd ~/rpmbuild/RPMS/x86_64/ $ yum localinstall slurm-20.11.5-1.el7.x86_64.rpm $ yum localinstall slurm-slurmd-20.11.5-1.el7.x86_64.rpm $ yum localinstall slurm-perlapi-20.11.5-1.el7.x86_64.rpm (6)Start SLURM $ systemctl start slurmd (C)Configuration Perform all the steps as the root user on every host. In the following example, it is assumed that the SLURM complex consists of two executing hosts, vhost1 and vhost2, one of which also serves as the sever host. (1)Obtain the number of VE nodes and the corresponding device file names: In the following example, eight VE nodes exist, which correspond to the device files /dev/veslot[0-7]. $ ls -ld /dev/ve* crw-rw-rw-. 1 root root 238, 0 Jul 28 13:45 /dev/ve0 crw-rw-rw-. 1 root root 238, 1 Jul 28 13:45 /dev/ve1 crw-rw-rw-. 1 root root 238, 2 Jul 28 13:45 /dev/ve2 crw-rw-rw-. 1 root root 238, 3 Jul 28 13:45 /dev/ve3 crw-rw-rw-. 1 root root 238, 4 Jul 28 13:45 /dev/ve4 crw-rw-rw-. 1 root root 238, 5 Jul 28 13:45 /dev/ve5 crw-rw-rw-. 1 root root 238, 6 Jul 28 13:45 /dev/ve6 crw-rw-rw-. 1 root root 238, 7 Jul 28 13:45 /dev/ve7 crw-------. 1 root root 234, 0 Jul 28 13:45 /dev/ve_peermem lrwxrwxrwx. 1 root root 3 Jul 28 13:45 /dev/veslot0 -> ve0 lrwxrwxrwx. 1 root root 3 Jul 28 13:45 /dev/veslot1 -> ve1 lrwxrwxrwx. 1 root root 3 Jul 28 13:45 /dev/veslot2 -> ve3 lrwxrwxrwx. 1 root root 3 Jul 28 13:45 /dev/veslot3 -> ve2 lrwxrwxrwx. 1 root root 3 Jul 28 13:45 /dev/veslot4 -> ve4 lrwxrwxrwx. 1 root root 3 Jul 28 13:45 /dev/veslot5 -> ve6 lrwxrwxrwx. 1 root root 3 Jul 28 13:45 /dev/veslot6 -> ve7 lrwxrwxrwx. 1 root root 3 Jul 28 13:45 /dev/veslot7 -> ve5 (2)Obtain the number of Host Channel Adaptors (HCA) and the corresponding device name definition file names: In the following example, two HCAs exist, which correspond to the device files /dev/infiniband/uverbs[0-1]. The device files correspond to the device name definition files named /sys/class/infiniband_verbs/uverbs0/ibdev and /sys/class/infiniband_verbs/uverbs1/ibdev. $ ls -ld /dev/infiniband/* crw-------. 1 root root 231, 64 Jul 28 13:45 /dev/infiniband/issm0 crw-------. 1 root root 231, 65 Jul 28 13:45 /dev/infiniband/issm1 crw-rw-rw-. 1 root root 10, 56 Jul 28 13:45 /dev/infiniband/rdma_cm crw-rw-rw-. 1 root root 231, 224 Jul 28 13:45 /dev/infiniband/ucm0 crw-rw-rw-. 1 root root 231, 225 Jul 28 13:45 /dev/infiniband/ucm1 crw-------. 1 root root 231, 0 Jul 28 13:45 /dev/infiniband/umad0 crw-------. 1 root root 231, 1 Jul 28 13:45 /dev/infiniband/umad1 crw-rw-rw-. 1 root root 231, 192 Jul 28 13:45 /dev/infiniband/uverbs0 crw-rw-rw-. 1 root root 231, 193 Jul 28 13:45 /dev/infiniband/uverbs1 (3) Edit the file /etc/slurm/slurm.conf. (3-1) Add the GresTypes line with the value "ve,hca". GresTypes=ve,hca (3-2) Add the NodeName lines defining the value of the parameter Gres as the numbers of VE nodes and HCAs for all the execution hosts. NodeName=vhost1 CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=2 RealMemory=128828 Gres=ve:10b:8,hca:2 State=UNKNOWN NodeName=vhost2 CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=2 RealMemory=128828 Gres=ve:10b:8,hca:2 State=UNKNOWN Set the values of the parameters other than Gres according to the output from the slurmd -C command. Refer to https://slurm.schedmd.com/slurm.conf.html for the details of the parameters. (4) Edit the file /etc/slurm/gres.conf. (4-1)Add the NodeName lines defining the VE node device file names for all the execution hosts. NodeName=vhost1 Name=ve Type=10b File=/dev/veslot[0-7] NodeName=vhost2 Name=ve Type=10b File=/dev/veslot[0-7] (4-2)Add the NodeName lines defining the HCA device name definition file names for all the execution hosts. NodeName=vhost1 Name=hca File=/sys/class/infiniband_verbs/uverbs0/ibdev NodeName=vhost1 Name=hca File=/sys/class/infiniband_verbs/uverbs1/ibdev NodeName=vhost2 Name=hca File=/sys/class/infiniband_verbs/uverbs0/ibdev NodeName=vhost2 Name=hca File=/sys/class/infiniband_verbs/uverbs1/ibdev Refer to https://slurm.schedmd.com/gres.conf.html for the details of the parameters. (5) Restart SLRUM $ systemctl restart slurmd (6)Confirm that Gres is properly configured with the command slurmd -G. [root@vhost1 ~]# /usr/sbin/slurmd -G slurmd: Gres Name=ve Type=10b Count=8 Index=0 ID=25974 File=/dev/veslot[0-7] (null) slurmd: Gres Name=hca Type=(null) Count=1 Index=0 ID=6382440 File=/sys/class/infiniband_verbs/uverbs0/ibdev (null) slurmd: Gres Name=hca Type=(null) Count=1 Index=0 ID=6382440 File=/sys/class/infiniband_verbs/uverbs1/ibdev (null) [root@vhost2 ~]# /usr/sbin/slurmd -G slurmd: Gres Name=ve Type=10b Count=8 Index=0 ID=25974 File=/dev/veslot[0-7] (null) slurmd: Gres Name=hca Type=(null) Count=1 Index=0 ID=6382440 File=/sys/class/infiniband_verbs/uverbs0/ibdev (null) slurmd: Gres Name=hca Type=(null) Count=1 Index=0 ID=6382440 File=/sys/class/infiniband_verbs/uverbs1/ibdev (null) Please refer to https://slurm.schedmd.com/slurmd.html for the details of the command slumd. (D)Usage The following is the #SBATCH line options relevant to VE in SLURM job script files: --overcommit (Mandatory) --nodes= ( specifies the number of VHs, which shall be greater than one. The --nodes option shall not be specified in one VH execution.) --ntasks=

(

specifies the total number of VE processes in a job) --qres=ve:[10b:] ( specifies the number of VE nodes on each VH) The number of VE processes assigned to each VH allocated is: Let be the remainder of

divided by and floor(

,) be the maximum integer less than or equal to the quotient of

divided by . Then the first VHs get floor(

,) + 1 processes and the rest floor(

,). The number of VE processes assigned to each VE nodes allocated for a job on a VH is: Let be the number of VE processes assigned to the VH, ceil(,) be the minimum integer greater than or equal to the quotient of divided by , be the maximum integer less than or equal to the quotient of divided by ceil(,), and be the remainder of divided by ceil(,). Then the first VE nodes get ceil(,) processes, the next VE node gets processes, and the rest gets zero. For example, the following job script allocates three VHs with eight VE nodes each and launches 24 processes onto the eight VE nodes allocated on each VH, resulting in three process execution on each VE node. ------------------------------------------------------------------------ #!/bin/bash #SBATCH --overcommit #SBATCH --nodes=3 #SBATCH --ntasks=72 #SBATCH --gres=ve:8 # Specifies a version of NEC MPI supporting SLURM for VE source /opt/nec/ve/mpi/3.0.0/bin/necmpivars.sh cd /usr/uhome/aurora/work # The value of the environment variable SLURM_NTASKS is the same as that # specified in the -ntasks option mpirun -v -np ${SLURM_NTASKS} ./execvehime.sh -----------------------------------------------------------------