SLURM for VE We have developed an initial implementation of SLURM for VE using the gres feature and NEC MPI for the SLURM. Currently we have the following restrictions: * Only pure VE jobs are supported. * Jobs with VE offloading or VH calls are not supported. * The --overcommit option shall be always specified. * An identical number of VE nodes are always allocated on every VH * allocated for a job. * Non-uniform process distribution is not supported; that is, * VE processes are always distributed onto VHs and then * VE nodes on each VH uniformly. Getting started (A)System Requirements (a)The following operating systems have been tested. CentOS Linux 7.6 CentOS Linux 8.3 Red Hat Enterprise Linux 8.4 (b)One server host and one or more execution hosts with VE nodes each The server host manages all the execution hosts on which jobs are executed. The server host also serves as an execution host. (B) Installation (a) Server Host Perform all the steps as the root user unless otherwise stated. (1) Install SLURM version 22.05.2, referring to https://slurm.schedmd.com/quickstart_admin.html (2) Stop SLURM, if it is running, as follows: # systemctl stop slurmctld # systemctl stop slurmdbd (3)Uninstall the SLURM package as follows, which causes packages depending on it to be uninstalled: # yum remove slurm Note that the package slurm-example-configs shall not be removed. (4)Create the SLURM for VE rpm files from the file slurm-22.05.2.tar.bz2 as a normal user. $ rpmbuild -ta slurm-22.05.2.tar.bz2 --with ve (5)Install the rpm files created at the previous step # cd ~/rpmbuild/RPMS/x86_64/ # yum localinstall slurm-22.05.2-1.el7.x86_64.rpm # yum localinstall slurm-slurmctld-22.05.2-1.el7.x86_64.rpm # yum localinstall slurm-slurmdbd-22.05.2-1.el7.x86_64.rpm # yum localinstall slurm-perlapi-22.05.2-1.el7.x86_64.rpm # yum localinstall slurm-for_ve-22.05.2-1.el7.x86_64.rpm (6)Start SLURM # systemctl start slurmctld # systemctl start slurmdbd (b) Execution Hosts Perform all the steps as the root user unless otherwise stated. (1) Install SLURM version 22.05.2, referring to https://slurm.schedmd.com/quickstart_admin.html (2) Stop SLURM, if it is running, as follows: # systemctl stop slurmd (3) Uninstall the SLURM package as follows, which causes packages depending on it to be uninstalled: # yum remove slurm Note that the package slurm-example-configs shall not be removed. (4) Create the SLURM for VE rpm files from the file slurm-22.05.2.tar.bz2 as a normal user. $ rpmbuild -ta slurm-22.05.2.tar.bz2 --with ve (5)Install the rpm files created at the previous step # cd ~/rpmbuild/RPMS/x86_64/ # yum localinstall slurm-22.05.2-1.el7.x86_64.rpm # yum localinstall slurm-slurmd-22.05.2-1.el7.x86_64.rpm # yum localinstall slurm-perlapi-22.05.2-1.el7.x86_64.rpm # yum localinstall slurm-for_ve-22.05.2-1.el7.x86_64.rpm (6)Start SLURM # systemctl start slurmd (C)Configuration Perform all the steps as the root user on every host. In the following example, it is assumed that the SLURM complex consists of two executing hosts, vhost1 and vhost2, one of which also serves as the sever host. (1)Obtain the number of VE nodes and the corresponding device file names: In the following example, eight VE nodes exist, which correspond to the device files /dev/veslot[0-7]. $ ls -ld /dev/ve* crw-rw-rw-. 1 root root 238, 0 Jul 28 13:45 /dev/ve0 crw-rw-rw-. 1 root root 238, 1 Jul 28 13:45 /dev/ve1 crw-rw-rw-. 1 root root 238, 2 Jul 28 13:45 /dev/ve2 crw-rw-rw-. 1 root root 238, 3 Jul 28 13:45 /dev/ve3 crw-rw-rw-. 1 root root 238, 4 Jul 28 13:45 /dev/ve4 crw-rw-rw-. 1 root root 238, 5 Jul 28 13:45 /dev/ve5 crw-rw-rw-. 1 root root 238, 6 Jul 28 13:45 /dev/ve6 crw-rw-rw-. 1 root root 238, 7 Jul 28 13:45 /dev/ve7 crw-------. 1 root root 234, 0 Jul 28 13:45 /dev/ve_peermem lrwxrwxrwx. 1 root root 3 Jul 28 13:45 /dev/veslot0 -> ve0 lrwxrwxrwx. 1 root root 3 Jul 28 13:45 /dev/veslot1 -> ve1 lrwxrwxrwx. 1 root root 3 Jul 28 13:45 /dev/veslot2 -> ve3 lrwxrwxrwx. 1 root root 3 Jul 28 13:45 /dev/veslot3 -> ve2 lrwxrwxrwx. 1 root root 3 Jul 28 13:45 /dev/veslot4 -> ve4 lrwxrwxrwx. 1 root root 3 Jul 28 13:45 /dev/veslot5 -> ve6 lrwxrwxrwx. 1 root root 3 Jul 28 13:45 /dev/veslot6 -> ve7 lrwxrwxrwx. 1 root root 3 Jul 28 13:45 /dev/veslot7 -> ve5 (2)Obtain the number of Host Channel Adaptors (HCA) and the corresponding device name definition file names: In the following example, two HCAs exist, which correspond to the device files /dev/infiniband/uverbs[0-1]. The device files correspond to the device name definition files named /sys/class/infiniband_verbs/uverbs0/ibdev and /sys/class/infiniband_verbs/uverbs1/ibdev. $ ls -ld /dev/infiniband/* crw-------. 1 root root 231, 64 Jul 28 13:45 /dev/infiniband/issm0 crw-------. 1 root root 231, 65 Jul 28 13:45 /dev/infiniband/issm1 crw-rw-rw-. 1 root root 10, 56 Jul 28 13:45 /dev/infiniband/rdma_cm crw-rw-rw-. 1 root root 231, 224 Jul 28 13:45 /dev/infiniband/ucm0 crw-rw-rw-. 1 root root 231, 225 Jul 28 13:45 /dev/infiniband/ucm1 crw-------. 1 root root 231, 0 Jul 28 13:45 /dev/infiniband/umad0 crw-------. 1 root root 231, 1 Jul 28 13:45 /dev/infiniband/umad1 crw-rw-rw-. 1 root root 231, 192 Jul 28 13:45 /dev/infiniband/uverbs0 crw-rw-rw-. 1 root root 231, 193 Jul 28 13:45 /dev/infiniband/uverbs1 (3) Edit the file /etc/slurm/slurm.conf. (3-1) Add the GresTypes line with the value "ve,hca". GresTypes=ve,hca (3-2) Add the NodeName lines defining the value of the parameter Gres as the numbers of VE nodes and HCAs for all the execution hosts. NodeName=vhost1 CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=2 RealMemory=128828 Gres=ve:10b:8,hca:2 State=UNKNOWN NodeName=vhost2 CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=2 RealMemory=128828 Gres=ve:10b:8,hca:2 State=UNKNOWN Set the values of the parameters other than Gres according to the output from the slurmd -C command. Refer to https://slurm.schedmd.com/slurm.conf.html for the details of the parameters. (4) Edit the file /etc/slurm/gres.conf. (4-1)Add the NodeName lines defining the VE node device file names for all the execution hosts. NodeName=vhost1 Name=ve Type=10b File=/dev/veslot[0-7] NodeName=vhost2 Name=ve Type=10b File=/dev/veslot[0-7] (4-2)Add the NodeName lines defining the HCA device name definition file names for all the execution hosts. NodeName=vhost1 Name=hca File=/dev/infiniband/uverbs0 NodeName=vhost1 Name=hca File=/dev/infiniband/uverbs1 NodeName=vhost2 Name=hca File=/dev/infiniband/uverbs0 NodeName=vhost2 Name=hca File=/dev/infiniband/uverbs1 Refer to https://slurm.schedmd.com/gres.conf.html for the details of the parameters. (5) Config for VE environment. One core per VE must be reserved for VEOS to run VE jobs. To reserve that core, use SLURM's Core Specialization feature. In the following, a specific setting procedure will be described using an example of an environment in which 2 sockets, 40 CPU cores, and 8VE are installed on VH. For more infor- mation on the Core Specialization feature, see https://slurm.schedmd.com/core_spec.html. (5-1) Make the following settings in /etc/slurm/slurm.conf. -Core specialization plugin CoreSpecPlugin=core_spec/none -Resource selection plugin SelectType=select/cons_res or SelectType=select/cons_tres -Task launch plugin TaskPlugin=task/cgroup TaskPluginParam=SlurmdOffSpec -Option of whether or not to allow individual jobs to override node's configured CoreSpcecCount value. AllowSpecResourcesUsage=NO or Do not write this option(because "NO" is applied in the default case). -Cores reserved for VEOS Specify CpuSpecList on the line containing the resource information of the compute node in /etc/slurm/slurm.conf. The following is an example of reserving 8 cores from the low- est core number. Note that there is no particular restriction on the core number to be reserved, and any number may be used. Example of setting for each VH node) NodeName=vhost1 CPUs=40 CpuSpecList=0,1,2,3,4,5,6,7 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=2 RealMemory=128828 Gres=ve:10b:8,hca:2 State=UNKNOWN Example of setting multiple VH nodes at once) NodeName=vhost[0-255] CPUs=40 CpuSpecList=0,1,2,3,4,5,6,7 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=2 RealMemory=128828 Gres=ve:10b:8,hca:2 State=UNKNOWN Note: 1)Core numbers can also be ranged using hyphens, such as 0-7. 2)CoreSpecCount is a number that is reserved for each socket, and if a certain number of cores is reserved for VEOS, a different value must be set for CoreSpecCount depend- ing on the number of sockets on the machine. CoreSpecCount is not recommended because it is easy to get the number of reserved cores wrong. (5-2) Make the following settings in /etc/slurm/cgroup.conf. -Option to constrain allowed cores to a subset of allocated resources ConstrainCores=yes ConstrainDevices=no or do not write this option(because "no" is applied in the default case) (6) Configuration for cleaning up NEC MPI shared memory When SLURM executes NEC MPI job, the shared memory allocated by NEC MPI may not be released. If the shared memory is not released, SLURM must release the shared memory at the end of job. Please configure the following settings to clean up the shared memory. Settings: Set fully qualified pathname of a script to clean up NEC MPI shared memory on every node when a user's job completes. Edit the file /etc/slurm/slurm.conf as follows. $ cat /etc/slurm/slurm.conf | grep Epilog Epilog=/etc/slurm/relmpimem.sh Notes: The file /etc/slurm/relmpimem.sh is a script to clean up NEC MPI shared memory. This script is included in the SLURM for VE rpm file. If you want to run the script and other scripts at the same time when a job is completed, set the script to run first. (e.g.) User's other epilog script: /usr/local/slurm/myepilog.sh $ cat /etc/slurm/slurm.conf | grep Epilog Epilog=/usr/local/slurm/myepilog.sh $ cat /usr/local/slurm/myepilog.sh #!/bin/bash # run a script to clean up NEC MPI shared memory first. /etc/slurm/relmpimem.sh # # then run user's various epilog program. # exit 0 (7) Restart SLURM # systemctl restart slurmctld # systemctl restart slurmd (8)Confirm the settings you made above (8-1) Confirm that Gres is properly configured with the command slurmd -G. [root@vhost1 ~]# /usr/sbin/slurmd -G slurmd: Gres Name=ve Type=10b Count=8 Index=0 ID=25974 File=/dev/veslot[0-7] (null) slurmd: Gres Name=hca Type=(null) Count=1 Index=0 ID=6382440 File=/dev/infiniband/uverbs0 (null) slurmd: Gres Name=hca Type=(null) Count=1 Index=0 ID=6382440 File=/dev/infiniband/uverbs1 (null) [root@vhost2 ~]# /usr/sbin/slurmd -G slurmd: Gres Name=ve Type=10b Count=8 Index=0 ID=25974 File=/dev/veslot[0-7] (null) slurmd: Gres Name=hca Type=(null) Count=1 Index=0 ID=6382440 File=/dev/infiniband/uverbs0 (null) slurmd: Gres Name=hca Type=(null) Count=1 Index=0 ID=6382440 File=/dev/infiniband/uverbs1 (null) Please refer to https://slurm.schedmd.com/slurmd.html for the details of the command slurmd. (8-2) Confirm the slurm.conf and cgroup.conf settings on the server host with scontrol show config. $ scontrol show config AllowSpecResourcesUsage = No CoreSpecPlugin = core_spec/none SelectType = select/cons_tres TaskPlugin = task/cgroup TaskPluginParam = (null type) # SlurmdOffSpec does not appear in results Cgroup Support Configuration: ConstrainCores = yes ConstrainDevices = no (8-3) Confirm the cores reserved for VEOS with the scontrol show node. $ scontrol show node NodeName=vhost1 CorePerSocket=10 ... CoreSpecCount=4 CPUSpecList=0-7 ... Note: CoreSpecCount is a number reserved for each socket on VH, and since the above example is 2 sockets, the number of cores per socket is 4. (D)Usage The following is the #SBATCH line options relevant to VE in SLURM job script files: --overcommit (Mandatory) --nodes= ( specifies the number of VHs, which shall be greater than one. The --nodes option shall not be specified in one VH execution.) --ntasks=

(

specifies the total number of VE processes in a job) --qres=ve:[10b:] ( specifies the number of VE nodes on each VH) The number of VE processes assigned to each VH allocated is: Let be the remainder of

divided by and floor(

,) be the maximum integer less than or equal to the quotient of

divided by . Then the first VHs get floor(

,) + 1 processes and the rest floor(

,). The number of VE processes assigned to each VE nodes allocated for a job on a VH is: Let be the number of VE processes assigned to the VH, ceil(,) be the minimum integer greater than or equal to the quotient of divided by , be the maximum integer less than or equal to the quotient of divided by ceil(,), and be the remainder of divided by ceil(,). Then the first VE nodes get ceil(,) processes, the next VE node gets processes, and the rest gets zero. For example, the following job script allocates three VHs with eight VE nodes each and launches 24 processes onto the eight VE nodes allocated on each VH, resulting in three process execution on each VE node. ------------------------------------------------------------------------ #!/bin/bash #SBATCH --overcommit #SBATCH --nodes=3 #SBATCH --ntasks=72 #SBATCH --gres=ve:8 # Specifies a version of NEC MPI supporting SLURM for VE source /opt/nec/ve/mpi/3.0.0/bin/necmpivars.sh cd /usr/uhome/aurora/work # The value of the environment variable SLURM_NTASKS is the same as that # specified in the --ntasks option mpirun -v -np ${SLURM_NTASKS} ./execvehime.sh ----------------------------------------------------------------- (E)Advanced features To use advanced features of SLURM for VE, please refer to the SLURM_for_VE_Advanced_Features.txt of the same directory.