SLURM for VE

  We have developed an initial implementation of SLURM for VE
  using the gres feature and NEC MPI for the SLURM.

    Currently we have the following restrictions:
    * Only pure VE jobs are supported.  
    * Jobs with VE offloading or VH calls are not supported.
    * The --overcommit option shall be always specified.
    * An identical number of VE nodes are always allocated on every VH 
    * allocated for a job.
    * Non-uniform process distribution is not supported; that is, 
    * VE processes are always distributed onto VHs and then 
    * VE nodes on each VH uniformly.

  Getting started

    (A)System Requirements
       (a)The following operating systems have been tested. 
         CentOS Linux 7.6
         CentOS Linux 8.3
         Red Hat Enterprise Linux 8.4
       (b)One server host and one or more execution hosts with VE nodes each
          The server host manages all the execution hosts on which 
          jobs are executed.
          The server host also serves as an execution host.

    (B) Installation
      (a) Server Host
         Perform all the steps as the root user unless otherwise stated.
         (1) Install SLURM version 20.11.5, referring to 
             https://slurm.schedmd.com/quickstart_admin.html
         (2) Stop SLURM, if it is running, as follows:
            $ systemctl stop slurmctld
            $ systemctl stop slurmdbd
         (3)Uninstall the SLURM package as follows, which causes packages 
            depending on it to be uninstalled:
            $ yum remove slurm
            Note that the package slurm-example-configs shall not be removed.
         (4)Create the SLURM for VE rpm files from the file 
            slurm-20.11.5.tar.bz2 as a normal user.
            $ rpmbuild -ta slurm-20.11.5.tar.bz2    
         (5)Install the rpm files created at the previous step
            $ cd ~/rpmbuild/RPMS/x86_64/
            $ yum localinstall slurm-20.11.5-1.el7.x86_64.rpm
            $ yum localinstall slurm-slurmctld-20.11.5-1.el7.x86_64.rpm
            $ yum localinstall slurm-slurmdbd-20.11.5-1.el7.x86_64.rpm
            $ yum localinstall slurm-perlapi-20.11.5-1.el7.x86_64.rpm
         (6)Start SLURM
            $ systemctl start slurmctld
            $ systemctl start slurmdbd  
      (b) Execution Hosts
         Perform all the steps as the root user unless otherwise stated.
         (1) Install SLURM version 20.11.5, referring to 
             https://slurm.schedmd.com/quickstart_admin.html
         (2) Stop SLURM, if it running, as follows:
            $ systemctl stop slurmd
         (3) Uninstall the SLURM package as follows, which causes packages 
             depending on it to be uninstalled:
            $ yum remove slurm
            Note that the package slurm-example-configs shall not be removed.
         (4) Create the SLURM for VE rpm files from the file 
             slurm-20.11.5.tar.bz2 as a normal user.
            $ rpmbuild -ta slurm-20.11.5.tar.bz2        
         (5)Install the rpm files created at the previous step
            $ cd ~/rpmbuild/RPMS/x86_64/
            $ yum localinstall slurm-20.11.5-1.el7.x86_64.rpm
            $ yum localinstall slurm-slurmd-20.11.5-1.el7.x86_64.rpm
            $ yum localinstall slurm-perlapi-20.11.5-1.el7.x86_64.rpm
         (6)Start SLURM
            $ systemctl start slurmd

    (C)Configuration
       Perform all the steps as the root user on every host.
       In the following example, it is assumed that the SLURM complex consists 
       of two executing hosts, vhost1 and vhost2, 
       one of which also serves as the sever host.
       (1)Obtain the number of VE nodes and the corresponding device file names:
          In the following example, eight VE nodes exist, which 
          correspond to the device files /dev/veslot[0-7].
          $ ls -ld /dev/ve*
          crw-rw-rw-. 1 root root 238, 0  Jul 28 13:45 /dev/ve0
          crw-rw-rw-. 1 root root 238, 1  Jul 28 13:45 /dev/ve1
          crw-rw-rw-. 1 root root 238, 2  Jul 28 13:45 /dev/ve2
          crw-rw-rw-. 1 root root 238, 3  Jul 28 13:45 /dev/ve3
          crw-rw-rw-. 1 root root 238, 4  Jul 28 13:45 /dev/ve4
          crw-rw-rw-. 1 root root 238, 5  Jul 28 13:45 /dev/ve5
          crw-rw-rw-. 1 root root 238, 6  Jul 28 13:45 /dev/ve6
          crw-rw-rw-. 1 root root 238, 7  Jul 28 13:45 /dev/ve7
          crw-------. 1 root root 234, 0  Jul 28 13:45 /dev/ve_peermem
          lrwxrwxrwx. 1 root root      3  Jul 28 13:45 /dev/veslot0 -> ve0
          lrwxrwxrwx. 1 root root      3  Jul 28 13:45 /dev/veslot1 -> ve1
          lrwxrwxrwx. 1 root root      3  Jul 28 13:45 /dev/veslot2 -> ve3
          lrwxrwxrwx. 1 root root      3  Jul 28 13:45 /dev/veslot3 -> ve2
          lrwxrwxrwx. 1 root root      3  Jul 28 13:45 /dev/veslot4 -> ve4
          lrwxrwxrwx. 1 root root      3  Jul 28 13:45 /dev/veslot5 -> ve6
          lrwxrwxrwx. 1 root root      3  Jul 28 13:45 /dev/veslot6 -> ve7
          lrwxrwxrwx. 1 root root      3  Jul 28 13:45 /dev/veslot7 -> ve5
       (2)Obtain the number of Host Channel Adaptors (HCA) and the corresponding 
          device name definition file names:
          In the following example, two HCAs exist, which correspond to 
          the device files /dev/infiniband/uverbs[0-1].
          The device files correspond to the device name definition files named 
          /sys/class/infiniband_verbs/uverbs0/ibdev and 
          /sys/class/infiniband_verbs/uverbs1/ibdev.
          $ ls -ld /dev/infiniband/*
          crw-------. 1 root root 231,  64  Jul 28 13:45 /dev/infiniband/issm0
          crw-------. 1 root root 231,  65  Jul 28 13:45 /dev/infiniband/issm1
          crw-rw-rw-. 1 root root  10,  56  Jul 28 13:45 /dev/infiniband/rdma_cm
          crw-rw-rw-. 1 root root 231, 224  Jul 28 13:45 /dev/infiniband/ucm0
          crw-rw-rw-. 1 root root 231, 225  Jul 28 13:45 /dev/infiniband/ucm1
          crw-------. 1 root root 231,   0  Jul 28 13:45 /dev/infiniband/umad0
          crw-------. 1 root root 231,   1  Jul 28 13:45 /dev/infiniband/umad1
          crw-rw-rw-. 1 root root 231, 192  Jul 28 13:45 /dev/infiniband/uverbs0
          crw-rw-rw-. 1 root root 231, 193  Jul 28 13:45 /dev/infiniband/uverbs1
       (3) Edit the file /etc/slurm/slurm.conf.
          (3-1) Add the GresTypes line with the value "ve,hca".
               GresTypes=ve,hca
          (3-2) Add the NodeName lines defining the value of the parameter Gres 
                as the numbers of VE nodes and HCAs for all the execution hosts.

             NodeName=vhost1 CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=2 RealMemory=128828 Gres=ve:10b:8,hca:2 State=UNKNOWN
             NodeName=vhost2 CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=2 RealMemory=128828 Gres=ve:10b:8,hca:2 State=UNKNOWN          

               Set the values of the parameters other than Gres according to the 
               output from the slurmd -C command.
               Refer to https://slurm.schedmd.com/slurm.conf.html 
               for the details of the parameters.         
   
       (4) Edit the file /etc/slurm/gres.conf.
          (4-1)Add the NodeName lines defining the VE node device file names for 
               all the execution hosts.
           NodeName=vhost1 Name=ve Type=10b File=/dev/veslot[0-7]
           NodeName=vhost2 Name=ve Type=10b File=/dev/veslot[0-7]
          (4-2)Add the NodeName lines defining the HCA device name definition 
               file names for all the execution hosts.
           NodeName=vhost1 Name=hca File=/sys/class/infiniband_verbs/uverbs0/ibdev
           NodeName=vhost1 Name=hca File=/sys/class/infiniband_verbs/uverbs1/ibdev
           NodeName=vhost2 Name=hca File=/sys/class/infiniband_verbs/uverbs0/ibdev
           NodeName=vhost2 Name=hca File=/sys/class/infiniband_verbs/uverbs1/ibdev
       
          Refer to https://slurm.schedmd.com/gres.conf.html for 
          the details of the parameters.
       (5) Restart SLRUM
        $ systemctl restart slurmd
       (6)Confirm that Gres is properly configured with the command slurmd -G.
        [root@vhost1 ~]# /usr/sbin/slurmd -G
        slurmd: Gres Name=ve Type=10b Count=8 Index=0 ID=25974 File=/dev/veslot[0-7] (null)
        slurmd: Gres Name=hca Type=(null) Count=1 Index=0 ID=6382440 File=/sys/class/infiniband_verbs/uverbs0/ibdev (null)
        slurmd: Gres Name=hca Type=(null) Count=1 Index=0 ID=6382440 File=/sys/class/infiniband_verbs/uverbs1/ibdev (null)
        [root@vhost2 ~]# /usr/sbin/slurmd -G
        slurmd: Gres Name=ve Type=10b Count=8 Index=0 ID=25974 File=/dev/veslot[0-7] (null)
        slurmd: Gres Name=hca Type=(null) Count=1 Index=0 ID=6382440 File=/sys/class/infiniband_verbs/uverbs0/ibdev (null)
        slurmd: Gres Name=hca Type=(null) Count=1 Index=0 ID=6382440 File=/sys/class/infiniband_verbs/uverbs1/ibdev (null)
        Please refer to https://slurm.schedmd.com/slurmd.html 
        for the details of the command slumd.

    (D)Usage
       The following is the #SBATCH line options relevant to VE 
       in SLURM job script files:
         --overcommit          (Mandatory)
         --nodes=<n>           (<n> specifies the number of VHs, 
                                    which shall be greater than one.
                                    The --nodes option shall not be 
                                    specified in one VH execution.)
         --ntasks=<p>          (<p> specifies the total number of VE processes 
                                in a job)
         --qres=ve:[10b:]<v>   (<v> specifies the number of VE nodes on each VH)

       The number of VE processes assigned to each VH allocated is:
         Let <r> be the remainder of <p> divided by <n> and floor(<p>,<n>) be 
         the maximum integer less than or equal to the 
         quotient of <p> divided by <n>.
         Then the first <r> VHs get floor(<p>,<n>) + 1 processes and 
         the rest floor(<p>,<n>).

       The number of VE processes assigned to each VE nodes allocated 
       for a job on a VH is:
         Let <pp> be the number of VE processes assigned to the VH, 
         ceil(<pp>,<v>) be the minimum integer greater than or equal 
         to the quotient of <pp> divided by <v>,
         <vv> be the maximum integer less than or equal to the quotient 
         of <pp> divided by ceil(<pp>,<v>), and <rr> be the remainder 
         of <pp> divided by ceil(<pp>,<v>).
         Then the first <vv> VE nodes get ceil(<pp>,<v>) processes, 
         the next VE node gets <rr> processes, and the rest gets zero. 

       For example, the following job script allocates three VHs 
       with eight VE nodes each and launches 24 processes onto 
       the eight VE nodes allocated on each VH, 
       resulting in three process execution on each VE node.

       ------------------------------------------------------------------------
       #!/bin/bash
       #SBATCH --overcommit
       #SBATCH --nodes=3
       #SBATCH --ntasks=72
       #SBATCH --gres=ve:8

       # Specifies a version of NEC MPI supporting SLURM for VE
       source /opt/nec/ve/mpi/3.0.0/bin/necmpivars.sh

       cd /usr/uhome/aurora/work
       # The value of the environment variable SLURM_NTASKS is the same as that 
       # specified in the -ntasks option
       mpirun -v -np ${SLURM_NTASKS} ./execvehime.sh
       -----------------------------------------------------------------