SLURM for VE

  We have developed an initial implementation of SLURM for VE
  using the gres feature and NEC MPI for the SLURM.

    Currently we have the following restrictions:
    * Only pure VE jobs are supported.  
    * Jobs with VE offloading or VH calls are not supported.
    * The --overcommit option shall be always specified.
    * An identical number of VE nodes are always allocated on every VH 
    * allocated for a job.
    * Non-uniform process distribution is not supported; that is, 
    * VE processes are always distributed onto VHs and then 
    * VE nodes on each VH uniformly.

  Getting started

    (A)System Requirements

       (a)The following operating systems have been tested. 
         CentOS Linux 7.6
         CentOS Linux 8.3
         Red Hat Enterprise Linux 8.4

       (b)One server host and one or more execution hosts with VE nodes each
          The server host manages all the execution hosts on which 
          jobs are executed.
          The server host also serves as an execution host.

    (B) Installation

      (a) Server Host

         Perform all the steps as the root user unless otherwise stated.
         (1) Install SLURM version 22.05.2, referring to 
             https://slurm.schedmd.com/quickstart_admin.html
         (2) Stop SLURM, if it is running, as follows:
            # systemctl stop slurmctld
            # systemctl stop slurmdbd
         (3)Uninstall the SLURM package as follows, which causes packages 
            depending on it to be uninstalled:
            # yum remove slurm
            Note that the package slurm-example-configs shall not be removed.
         (4)Create the SLURM for VE rpm files from the file 
            slurm-22.05.2.tar.bz2 as a normal user.
            $ rpmbuild -ta slurm-22.05.2.tar.bz2 --with ve
         (5)Install the rpm files created at the previous step
            # cd ~/rpmbuild/RPMS/x86_64/
            # yum localinstall slurm-22.05.2-1.el7.x86_64.rpm
            # yum localinstall slurm-slurmctld-22.05.2-1.el7.x86_64.rpm
            # yum localinstall slurm-slurmdbd-22.05.2-1.el7.x86_64.rpm
            # yum localinstall slurm-perlapi-22.05.2-1.el7.x86_64.rpm
            # yum localinstall slurm-for_ve-22.05.2-1.el7.x86_64.rpm
         (6)Start SLURM
            # systemctl start slurmctld
            # systemctl start slurmdbd  

      (b) Execution Hosts

         Perform all the steps as the root user unless otherwise stated.
         (1) Install SLURM version 22.05.2, referring to 
             https://slurm.schedmd.com/quickstart_admin.html
         (2) Stop SLURM, if it is running, as follows:
            # systemctl stop slurmd
         (3) Uninstall the SLURM package as follows, which causes packages 
             depending on it to be uninstalled:
            # yum remove slurm
            Note that the package slurm-example-configs shall not be removed.
         (4) Create the SLURM for VE rpm files from the file 
             slurm-22.05.2.tar.bz2 as a normal user.
            $ rpmbuild -ta slurm-22.05.2.tar.bz2 --with ve
         (5)Install the rpm files created at the previous step
            # cd ~/rpmbuild/RPMS/x86_64/
            # yum localinstall slurm-22.05.2-1.el7.x86_64.rpm
            # yum localinstall slurm-slurmd-22.05.2-1.el7.x86_64.rpm
            # yum localinstall slurm-perlapi-22.05.2-1.el7.x86_64.rpm
            # yum localinstall slurm-for_ve-22.05.2-1.el7.x86_64.rpm
         (6)Start SLURM
            # systemctl start slurmd

    (C)Configuration

       Perform all the steps as the root user on every host.
       In the following example, it is assumed that the SLURM complex consists 
       of two executing hosts, vhost1 and vhost2, 
       one of which also serves as the sever host.

       (1)Obtain the number of VE nodes and the corresponding device file names:
          In the following example, eight VE nodes exist, which 
          correspond to the device files /dev/veslot[0-7].
          $ ls -ld /dev/ve*
          crw-rw-rw-. 1 root root 238, 0  Jul 28 13:45 /dev/ve0
          crw-rw-rw-. 1 root root 238, 1  Jul 28 13:45 /dev/ve1
          crw-rw-rw-. 1 root root 238, 2  Jul 28 13:45 /dev/ve2
          crw-rw-rw-. 1 root root 238, 3  Jul 28 13:45 /dev/ve3
          crw-rw-rw-. 1 root root 238, 4  Jul 28 13:45 /dev/ve4
          crw-rw-rw-. 1 root root 238, 5  Jul 28 13:45 /dev/ve5
          crw-rw-rw-. 1 root root 238, 6  Jul 28 13:45 /dev/ve6
          crw-rw-rw-. 1 root root 238, 7  Jul 28 13:45 /dev/ve7
          crw-------. 1 root root 234, 0  Jul 28 13:45 /dev/ve_peermem
          lrwxrwxrwx. 1 root root      3  Jul 28 13:45 /dev/veslot0 -> ve0
          lrwxrwxrwx. 1 root root      3  Jul 28 13:45 /dev/veslot1 -> ve1
          lrwxrwxrwx. 1 root root      3  Jul 28 13:45 /dev/veslot2 -> ve3
          lrwxrwxrwx. 1 root root      3  Jul 28 13:45 /dev/veslot3 -> ve2
          lrwxrwxrwx. 1 root root      3  Jul 28 13:45 /dev/veslot4 -> ve4
          lrwxrwxrwx. 1 root root      3  Jul 28 13:45 /dev/veslot5 -> ve6
          lrwxrwxrwx. 1 root root      3  Jul 28 13:45 /dev/veslot6 -> ve7
          lrwxrwxrwx. 1 root root      3  Jul 28 13:45 /dev/veslot7 -> ve5

       (2)Obtain the number of Host Channel Adaptors (HCA) and the corresponding 
          device name definition file names:
          In the following example, two HCAs exist, which correspond to 
          the device files /dev/infiniband/uverbs[0-1].
          The device files correspond to the device name definition files named 
          /sys/class/infiniband_verbs/uverbs0/ibdev and 
          /sys/class/infiniband_verbs/uverbs1/ibdev.
          $ ls -ld /dev/infiniband/*
          crw-------. 1 root root 231,  64  Jul 28 13:45 /dev/infiniband/issm0
          crw-------. 1 root root 231,  65  Jul 28 13:45 /dev/infiniband/issm1
          crw-rw-rw-. 1 root root  10,  56  Jul 28 13:45 /dev/infiniband/rdma_cm
          crw-rw-rw-. 1 root root 231, 224  Jul 28 13:45 /dev/infiniband/ucm0
          crw-rw-rw-. 1 root root 231, 225  Jul 28 13:45 /dev/infiniband/ucm1
          crw-------. 1 root root 231,   0  Jul 28 13:45 /dev/infiniband/umad0
          crw-------. 1 root root 231,   1  Jul 28 13:45 /dev/infiniband/umad1
          crw-rw-rw-. 1 root root 231, 192  Jul 28 13:45 /dev/infiniband/uverbs0
          crw-rw-rw-. 1 root root 231, 193  Jul 28 13:45 /dev/infiniband/uverbs1

       (3) Edit the file /etc/slurm/slurm.conf.
          (3-1) Add the GresTypes line with the value "ve,hca".
               GresTypes=ve,hca
          (3-2) Add the NodeName lines defining the value of the parameter Gres 
                as the numbers of VE nodes and HCAs for all the execution hosts.

             NodeName=vhost1 CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=2 RealMemory=128828 Gres=ve:10b:8,hca:2 State=UNKNOWN
             NodeName=vhost2 CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=2 RealMemory=128828 Gres=ve:10b:8,hca:2 State=UNKNOWN          

               Set the values of the parameters other than Gres according to the 
               output from the slurmd -C command.
               Refer to https://slurm.schedmd.com/slurm.conf.html 
               for the details of the parameters.         
   
       (4) Edit the file /etc/slurm/gres.conf.
          (4-1)Add the NodeName lines defining the VE node device file names for 
               all the execution hosts.
           NodeName=vhost1 Name=ve Type=10b File=/dev/veslot[0-7]
           NodeName=vhost2 Name=ve Type=10b File=/dev/veslot[0-7]
          (4-2)Add the NodeName lines defining the HCA device name definition 
               file names for all the execution hosts.
           NodeName=vhost1 Name=hca File=/dev/infiniband/uverbs0
           NodeName=vhost1 Name=hca File=/dev/infiniband/uverbs1
           NodeName=vhost2 Name=hca File=/dev/infiniband/uverbs0
           NodeName=vhost2 Name=hca File=/dev/infiniband/uverbs1
       
          Refer to https://slurm.schedmd.com/gres.conf.html for 
          the details of the parameters.

       (5) Config for VE environment.
          One core per VE must be reserved for VEOS to run VE jobs. To reserve that core, use
          SLURM's Core Specialization feature.

          In the following, a specific setting procedure will be described using an example of an 
          environment in which 2 sockets, 40 CPU cores, and 8VE are installed on VH. For more infor-
          mation on the Core Specialization feature, see https://slurm.schedmd.com/core_spec.html.

          (5-1) Make the following settings in /etc/slurm/slurm.conf.
           -Core specialization plugin
            CoreSpecPlugin=core_spec/none

           -Resource selection plugin
            SelectType=select/cons_res or SelectType=select/cons_tres

           -Task launch plugin
            TaskPlugin=task/cgroup
            TaskPluginParam=SlurmdOffSpec

           -Option of whether or not to allow individual jobs to override node's configured 
            CoreSpcecCount value.
            AllowSpecResourcesUsage=NO
              or
            Do not write this option(because "NO" is applied in the default case).

           -Cores reserved for VEOS
            Specify CpuSpecList on the line containing the resource information of the compute node 
            in /etc/slurm/slurm.conf. The following is an example of reserving 8 cores from the low-
            est core number. Note that there is no particular restriction on the core number to be 
            reserved, and any number may be used.

            Example of setting for each VH node)
            NodeName=vhost1 CPUs=40 CpuSpecList=0,1,2,3,4,5,6,7 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=2 RealMemory=128828 Gres=ve:10b:8,hca:2 State=UNKNOWN

            Example of setting multiple VH nodes at once)
            NodeName=vhost[0-255] CPUs=40 CpuSpecList=0,1,2,3,4,5,6,7 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=2 RealMemory=128828 Gres=ve:10b:8,hca:2 State=UNKNOWN

            Note:
              1)Core numbers can also be ranged using hyphens, such as 0-7.
              2)CoreSpecCount is a number that is reserved for each socket, and if a certain number 
                of cores is reserved for VEOS, a different value must be set for CoreSpecCount depend-
                ing on the number of sockets on the machine. CoreSpecCount is not recommended because 
                it is easy to get the number of reserved cores wrong.  

          (5-2) Make the following settings in /etc/slurm/cgroup.conf.
           -Option to constrain allowed cores to a subset of allocated resources
            ConstrainCores=yes
            ConstrainDevices=no or do not write this option(because "no" is applied in the default case)

       (6) Configuration for cleaning up NEC MPI shared memory
          When SLURM executes NEC MPI job, the shared memory allocated by NEC MPI may not be released.
          If the shared memory is not released, SLURM must release the shared memory at the end of job.
          Please configure the following settings to clean up the shared memory.

          Settings: Set fully qualified pathname of a script to clean up NEC MPI shared memory on every
                    node when a user's job completes. Edit the file /etc/slurm/slurm.conf as follows.

                    $ cat /etc/slurm/slurm.conf | grep Epilog
                    Epilog=/etc/slurm/relmpimem.sh

          Notes:    The file /etc/slurm/relmpimem.sh is a script to clean up NEC MPI shared memory.
                    This script is included in the SLURM for VE rpm file.

                    If you want to run the script and other scripts at the same time when a job is
                    completed, set the script to run first.

                    (e.g.)
                        User's other epilog script: /usr/local/slurm/myepilog.sh

                        $ cat /etc/slurm/slurm.conf | grep Epilog
                        Epilog=/usr/local/slurm/myepilog.sh

                        $ cat /usr/local/slurm/myepilog.sh
                        #!/bin/bash
                        # run a script to clean up NEC MPI shared memory first.
                        /etc/slurm/relmpimem.sh
                        # 
                        # then run user's various epilog program.
                        #
                        exit 0

       (7) Restart SLURM
        # systemctl restart slurmctld
        # systemctl restart slurmd

       (8)Confirm the settings you made above 
         (8-1) Confirm that Gres is properly configured with the command slurmd -G.
          [root@vhost1 ~]# /usr/sbin/slurmd -G
          slurmd: Gres Name=ve Type=10b Count=8 Index=0 ID=25974 File=/dev/veslot[0-7] (null)
          slurmd: Gres Name=hca Type=(null) Count=1 Index=0 ID=6382440 File=/dev/infiniband/uverbs0 (null)
          slurmd: Gres Name=hca Type=(null) Count=1 Index=0 ID=6382440 File=/dev/infiniband/uverbs1 (null)
          [root@vhost2 ~]# /usr/sbin/slurmd -G
          slurmd: Gres Name=ve Type=10b Count=8 Index=0 ID=25974 File=/dev/veslot[0-7] (null)
          slurmd: Gres Name=hca Type=(null) Count=1 Index=0 ID=6382440 File=/dev/infiniband/uverbs0 (null)
          slurmd: Gres Name=hca Type=(null) Count=1 Index=0 ID=6382440 File=/dev/infiniband/uverbs1 (null)
          Please refer to https://slurm.schedmd.com/slurmd.html for the details of the command slurmd.

         (8-2) Confirm the slurm.conf and cgroup.conf settings on the server host with scontrol show config.
          $ scontrol show config
          AllowSpecResourcesUsage = No
          CoreSpecPlugin          = core_spec/none
          SelectType              = select/cons_tres
          TaskPlugin              = task/cgroup
          TaskPluginParam         = (null type) # SlurmdOffSpec does not appear in results
          Cgroup Support Configuration:
          ConstrainCores          = yes
          ConstrainDevices        = no

         (8-3) Confirm the cores reserved for VEOS with the scontrol show node.
          $ scontrol show node
          NodeName=vhost1 CorePerSocket=10
             ...
             CoreSpecCount=4 CPUSpecList=0-7
             ...

          Note: CoreSpecCount is a number reserved for each socket on VH, and since the above example
                is 2 sockets, the number of cores per socket is 4.

    (D)Usage

       The following is the #SBATCH line options relevant to VE 
       in SLURM job script files:
         --overcommit          (Mandatory)
         --nodes=<n>           (<n> specifies the number of VHs, 
                                    which shall be greater than one.
                                    The --nodes option shall not be 
                                    specified in one VH execution.)
         --ntasks=<p>          (<p> specifies the total number of VE processes 
                                in a job)
         --qres=ve:[10b:]<v>   (<v> specifies the number of VE nodes on each VH)

       The number of VE processes assigned to each VH allocated is:
         Let <r> be the remainder of <p> divided by <n> and floor(<p>,<n>) be 
         the maximum integer less than or equal to the 
         quotient of <p> divided by <n>.
         Then the first <r> VHs get floor(<p>,<n>) + 1 processes and 
         the rest floor(<p>,<n>).

       The number of VE processes assigned to each VE nodes allocated 
       for a job on a VH is:
         Let <pp> be the number of VE processes assigned to the VH, 
         ceil(<pp>,<v>) be the minimum integer greater than or equal 
         to the quotient of <pp> divided by <v>,
         <vv> be the maximum integer less than or equal to the quotient 
         of <pp> divided by ceil(<pp>,<v>), and <rr> be the remainder 
         of <pp> divided by ceil(<pp>,<v>).
         Then the first <vv> VE nodes get ceil(<pp>,<v>) processes, 
         the next VE node gets <rr> processes, and the rest gets zero. 

       For example, the following job script allocates three VHs 
       with eight VE nodes each and launches 24 processes onto 
       the eight VE nodes allocated on each VH, 
       resulting in three process execution on each VE node.

       ------------------------------------------------------------------------
       #!/bin/bash
       #SBATCH --overcommit
       #SBATCH --nodes=3
       #SBATCH --ntasks=72
       #SBATCH --gres=ve:8

       # Specifies a version of NEC MPI supporting SLURM for VE
       source /opt/nec/ve/mpi/3.0.0/bin/necmpivars.sh

       cd /usr/uhome/aurora/work
       # The value of the environment variable SLURM_NTASKS is the same as that 
       # specified in the --ntasks option
       mpirun -v -np ${SLURM_NTASKS} ./execvehime.sh
       -----------------------------------------------------------------
    
   (E)Advanced features

     To use advanced features of SLURM for VE, please refer to the 
     SLURM_for_VE_Advanced_Features.txt of the same directory.