3.1   Compiling and Linking MPI Programs
Firstly, please execute the following command to read a setup script each time you log in to a VH, in
order to set up the MPI compilation environment. {version} is the directory name corresponding to the
version of NEC MPI you use.
The setting is available until you log out.
(For bash)
$ source /opt/nec/ve/mpi/{version}/bin/necmpivars.sh
(For VE30: $ source /opt/nec/ve3/mpi/{version}/bin/necmpivars.sh)
(For csh)
% source /opt/nec/ve/mpi/{version}/bin/necmpivars.csh
(For VE30: % source /opt/nec/ve3/mpi/{version}/bin/necmpivars.csh)
It is possible to compile and link MPI programs with the MPI compilation commands corresponding to each programing language as follows:
To compile and link MPI programs written in Fortran, please execute the mpinfort/mpifort command as follows
To compile and link MPI programs written in C, please execute the mpincc/mpicc command as follows
$ mpinfort [options] {sourcefiles}
To compile and link MPI programs written in C++, please execute the mpinc++/mpic++ command as follows
$ mpincc [options] {sourcefiles}
In the command lines above, {sourcefiles} means MPI program source files, and [options] means optional compiler options.
$ mpinc++ [options] {sourcefiles}
NEC MPI compile commands, mpincc/mpicc, mpinc++/mpic++ and mpinfort/mpifort, will use the default version of compilers, ncc, nc++ and nfort, respectively. NEC MPI compile command option -compiler or an environment variable can be used to select a compiler version, if another version of compiler must be used. In this case, a compiler version and NEC MPI version must be selected carefully to match each other.
example: If a compiler version 2.x.x is used to compile and link a C program.
$ mpincc -compiler /opt/nec/ve/bin/ncc-2.x.x program.c
Table 3-1 The List of NEC MPI Compiler Commands Options Option Meaning -mpimsgq | -msgq Use the MPI message queue facility for the Debugger -mpiprof Use the MPI communication information facility and use MPI profiling interface (MPI procedure with names beginning with PMPI_). Please refer to this section for the MPI communication information facility. -mpitrace Use the MPI procedures tracing facility. The MPI communication information facility and MPI profiling interface are also available. Please refer to this section for the MPI procedures tracing facility. -mpiverify Use the debug assist feature for MPI collective procedures. The MPI communication information facility and MPI profiling interface are also available. Please refer to this section for the debug assist feature for MPI collective procedures. -ftrace Use the FTRACE facility for MPI program. The MPI communication information facility and MPI profiling interface are also available. Please refer to this section for the FTRACE facility. -show Display the sequence of compiler execution invoked by the MPI compilation command without actual execution -ve Compile and link MPI programs to run on VE (default) -vh
-shCompile and link MPI programs to run on VH or SH -static-mpi Link against MPI libraries statically, but MPI memory management library is linked dynamically (default) -shared-mpi Link against all MPI libraries dynamically -compiler <compiler> Specify a compiler invoked by the MPI compilation command following space. If this option is not specified, each compile command starts the following compiler. The compilers listed below are supported to compile and link MPI programs to run on VH or Scalar Host. Note that the support period for these compilers by NEC MPI is equivalent to that of the compiler itself. For compatibility reason, NEC MPI will be released for compilers whose support is ended, but any updates such as functional improvements may not be released. See also 2.10 about using mpi_f08 fortran module.
- GNU Compiler Collection
- 4.8.5
- 8.3.0 and 8.3.1
- 8.4.0 and 8.4.1
- 8.5.0
- 9.1.0 and compatible version
- Intel C++ Compiler and Intel Fortran Compiler
- 19.0.4.243 (Intel Parallel Studio XE 2019 Update 4) and compatible version
- 19.1.2.254 (Intel Parallel Studio XE 2020 Update 2)
- NVIDIA Cuda compiler
- 11.1
- 11.8
- NVIDIA HPC SDK compiler
- 22.7
Compilation Command Invoked Compiler mpincc/mpicc ncc mpinc++/mpic++ nc++ mpinfort/mpifort nfort Compilation Command with -vh/-sh Invoked Compiler mpincc/mpicc gcc mpinc++/mpic++ g++ mpinfort/mpifort gfortran -compiler_host <compiler> For a VH or scalar node, if the specified compiler with -compiler option is not GNU nor Intel compiler, nvcc for CUDA for example, this option must specify a GNU or Intel compiler compatible with the compiler specified by -compiler option. Note that if the GNU or Intel compiler is identical to the default one for NEC MPI, see -compiler option above, this option can be omitted. -mpifp16 <binary16|bfloat16> Assumes that MPI primitive data types NEC_MPI_FLOAT16 and MPI_REAL2 are the specified format with this option, regardless of the floating-point binary format option -mfp16-format. Default is binary16, if the -mfp16-format option is omitted.
Table 3-2 The List of Environment Variables of NEC MPI Compiler Commands Environment Variable Meaning NMPI_CC Change a compiler which you use to compile and link a mpi program on VE by mpincc command. NMPI_CXX Change a compiler which you use to compile and link a mpi program on VE by mpinc++ command. NMPI_FC Change a compiler which you use to compile and link a mpi program on VE by mpinfort command. NMPI_CC_H Change a compiler which you use to compile and link a mpi program on VH or Scalar Host by mpincc command. NMPI_CXX_H Change a compiler which you use to compile and link a mpi program on VH or Scalar Host by mpinc++ command. NMPI_FC_H Change a compiler which you use to compile and link a mpi program on VH or Scalar Host by mpinfort command.
The above environment variables in Table 3-2 are overridden by -compiler option.
An example of each compiler is shown below.
example1: NEC Compiler
example2: GNU compiler
$ source /opt/nec/ve/mpi/3.x.x/bin/necmpivars.sh (For VE30: $ source /opt/nec/ve3/mpi/3.x.x/bin/necmpivars.sh) $ mpincc a.c $ mpinc++ a.cpp $ mpinfort a.f90
example3: Intel compiler
(setup the GNU compiler (e.g. PATH, LD_LIBRARY_PATH) $ source /opt/nec/ve/mpi/3.x.x/bin/necmpivars.sh (For VE30: $ source /opt/nec/ve3/mpi/3.x.x/bin/necmpivars.sh) $ mpincc -vh a.c $ mpinc++ -vh a.cpp $ mpinfort -vh a.f90
example4: NVIDIA HPC SDK compiler
(setup the Intel compiler (e.g. PATH, LD_LIBRARY_PATH) $ source /opt/nec/ve/mpi/3.x.x/bin/necmpivars.sh (For VE30: $ source /opt/nec/ve3/mpi/3.x.x/bin/necmpivars.sh) $ export NMPI_CC_H=icc $ export NMPI_CXX_H=icpc $ export NMPI_FC_H=ifort $ mpincc -vh a.c $ mpinc++ -vh a.cpp $ mpinfort -vh a.f90
(setup the NVIDIA HPC SDK compiler (e.g. PATH, LD_LIBRARY_PATH) $ source /opt/nec/ve/mpi/3.x.x/bin/necmpivars.sh (For VE30: $ source /opt/nec/ve3/mpi/3.x.x/bin/necmpivars.sh) $ export NMPI_CC_H=nvc $ export NMPI_CXX_H=nvc++ $ export NMPI_FC_H=nvfortran $ mpincc -vh a.c $ mpinc++ -vh a.cpp $ mpinfort -vh a.f90
If MPI process running on a VH or a scalar node uses VEO or CUDA features, programs can be compiled and linked as follows.
For VEO
Please specify explicitly options for VEO include files and libraries.
$ mpincc -vh mpi-veo.c -o mpi-veo -I/opt/nec/ve/veos/include -L/opt/nec/ve/veos/lib64 -Wl,-rpath=/opt/nec/ve/veos/lib64 -lveo |
Please see also "The Tutorial and API Reference of Alternative VE Offloading" about the usage of VEO.
For CUDA
Please specify a compiler for CUDA, nvcc for example.
$ mpincc -sh -compiler nvcc --cudart shared mpi-cuda.c -o mpi-cuda |
3.2   Starting MPI Programs
Before use, please setup your compiler referring to 3.1 and
execute the following command to read a setup script each time you log in to a VH,
in order to setup the MPI execution environment.
{version} is the directory name corresponding to the
version of NEC MPI you use.
This setting is available until you log out.
(For bash) $ source /opt/nec/ve/mpi/{version}/bin/necmpivars.sh (For VE30: $ source /opt/nec/ve3/mpi/{version}/bin/necmpivars.sh) (For csh) % source /opt/nec/ve/mpi/{version}/bin/necmpivars.csh (For VE30: % source /opt/nec/ve3/mpi/{version}/bin/necmpivars.csh)
By default, the MPI libraries whose version is same as compiling and linking are searched and MPI program is dynamically linked against them as needed. By loading setup script, the MPI libraries corresponding to above {version} will be searched. Thus, when MPI program is dynamically linked against all MPI libraries with -shared-mpi, You can change MPI libraries to corresponding them to above {version} at runtime.
When -shared-mpi is not specified at compiling and linking time, MPI program is dynamically linked against MPI memory management library and statically linked against the other MPI libraries. The MPI libraries linked statically cannot be changed at runtime.
If you use hybrid execution which consist of vector processes and scalar processes, execute the below command instead of the above. By loading setup script by the below command, in addition to VE, the MPI program executed on VH or a scalar host also is dynamically linked against the MPI libraries to corresponding to below {version}.
The {version} is the directory name corresponding to the version of NEC MPI which contains MPI libraries the MPI program is dynamically linked against. [gnu|intel] be specified as the first argument. [compiler-version] is specified as the second argument. [compiler-version] is the compiler version used at compiling and linking. You can get the value of each argument from the RUNPATH of MPI program. In the below example, the first argument is the value of the wave line part (gnu) and the second argument is the value of the dashed line part (9.1.0)
(For bash) $ source /opt/nec/ve/mpi/{version}/bin/necmpivars.sh [gnu|intel] [compiler-version] (For VE30: $ source /opt/nec/ve3/mpi/{version}/bin/necmpivars.sh [gnu|intel] [compiler-version]) (For csh) % source /opt/nec/ve/mpi/{version}/bin/necmpivars.csh [gnu|intel] [compiler-version] (For VE30: % source /opt/nec/ve3/mpi/{version}/bin/necmpivars.csh [gnu|intel] [compiler-version])
$ /usr/bin/readelf -W -d vh.out | grep RUNPATH 0x000000000000001d (RUNPATH) Library runpath: [/opt/nec/ve/mpi/2.3.0/lib64/vh/gnu/9.1.0]
NEC MPI provides the MPI execution commands mpirun and mpiexec to launch MPI programs. Any of the following command lines is available:
$ mpirun [global-options] [local-options] {MPIexec} [args] [ : [local-options] {MPIexec} [args] ]...
$ mpiexec [global-options] [local-options] {MPIexec} [args] [ : [local-options] {MPIexec} [args] ]...
The MPI execution commands support executing MPI programs linked with MPI libraries that are the same or older than the command version.
If you use the MPI execution commands located in the system standard path /opt/nec/ve/bin, load necmpivars.sh or necmpivars.csh before executing the MPI program.
If you use a specific version of the MPI execution commands that are not located in the system standard path, load necmpivars-runtime.sh or necmpivars-runtime.csh located in the /opt/nec/ve/mpi/{version}/bin directory instead of necmpivars.sh or necmpivars.csh. {version} is the directory name corresponding to the version of NEC MPI that contains the MPI execution commands to use. necmpivars-runtime.sh and necmpivars-runtime.csh can be used in the same way as necmpivars.sh and necmpivars.csh, and they configure that the specified version of the MPI execution commands are used in addition to the settings configured by necmpivars.sh and necmpivars.csh.
Note that the specific version of the MPI execution commands that are not located in the system standard path cannot be used in NQSV Request submitted to a batch queue that MPD is selected as NEC MPI Process Manager. If you load necmpivars-runtime.sh or necmpivars-runtime.csh in the request, the following warning message is shown and the setting for the MPI execution is not configured.
necmpivars-runtime.sh: Warning: This script cannot be used in NQSV Request submitted to a batch queue that MPD is selected as NEC MPI Process Manager.
3.2.1   Specification of Program Execution
The following can be specified as
MPI-execution specification {MPIexec}
in the MPI execution commands:
Specify an MPI executable file {execfile} as follows:
$ mpirun -np 2 {execfile} |
Specify a shell script that executes an MPI executable file {execfile} as follows:
$ cat shell.sh #!/bin/sh {execfile} $ mpirun -np 2 ./shell.sh |
The explanation above is based on the assumption that the Linux binfmt_misc capability has been configured, which is the default software development environment in the SX-Aurora TSUBASA. The configuration of the binfmt_misc capability requires the system administrator privileges. Please refer to "SX-Aurora TSUBASA Installation Guide", or contact the system administrator for details.
It is possible to execute MPI programs by specifying MPI-execution specification {MPIexec} as follows, even in the case that the binfmt_misc capability has not been configured.
- The ve_exec command "/opt/nec/ve/bin/ve_exec" and an MPI executable file {execfile}
Specify the ve_exec command "/opt/nec/ve/bin/ve_exec" and an MPI executable file {execfile} as follows:
$ mpirun -np 2 /opt/nec/ve/bin/ve_exec {execfile} - Shell script that specifies the ve_exec command "/opt/nec/ve/bin/ve_exec" and an MPI executable file {execfile}
Specify a shell script that specifies the ve_exec command "/opt/nec/ve/bin/ve_exec" and an MPI executable file {execfile} as follows:
$ cat shell.sh
#!/bin/sh
/opt/nec/ve/bin/ve_exec {execfile}
$ mpirun -np 2 ./shell.sh
3.2.2   Runtime Options
The term host in runtime options indicates a VH or a VE. Please refer to
the clause for how to specify hosts.
The following table shows available global options.
Table 3-3 The List of Global Options Global Option Meaning -machinefile | -machine <filename> A file that describes hosts and the number of processes to be launched.
The format is "hostname[:value]" per line. The default value of the number of processes (":value") is 1, if it is omitted.-configfile <filename> A file containing runtime options.
In the file <filename>, specify one or more option lines.
Runtime options and MPI execution specifications {MPIexec} such as MPI executable file are specified on each line. If the beginning of the line is "#", that line is treated as a comment.-hosts <host-list> Comma-separated list of hosts on which MPI processes are launched.
When the options -hosts and -hostfile are specified more than once, the hosts specified in each successive option are treated as a continuation of the list of the specified hosts.
This option must not be specified together with the option -host, -nn, or -node.-hostfile | -f <filename> Name of a file that specifies hosts on which MPI processes are launched.
When the options -hosts, -f and -hostfile are specified more than once, the hosts specified in each successive option are treated as a continuation of the list of the specified hosts.
This option must not be specified together with the option -host, -nn, or -node.-gvenode Hosts specified in the options indicates VEs. -perhost | -ppn | -N | -npernode | -nnp <value> MPI processes in groups of the specified number <value> are assigned to respective hosts.
The assignment of MPI processes to hosts is circularly performed until every process is assigned to a host.
When this option is omitted, the default value is (P+H-1)/H, where P is the total number of MPI processes and H is the number of hosts.-launcher-exec <fullpath> Full path name of the remote shell that launches MPI daemons.
The default value is /usr/bin/ssh. This option is only available only in the interactive execution.-max_np | -universe_size <max_np> Specify the maximum number of MPI processes, including MPI processes dynamically generated at runtime. The default value is the number specified with the -np option. If some -np options are specified, the default value is the sum of the numbers specified with the options. -multi Specify that MPI program is executed on multiple hosts. Use this option, if all MPI processes are generated in a single host at the start of program execution and then MPI processes are generated on the other hosts by the MPI dynamic process generation function, resulting in multiple host execution. -genv <varname> <value> Pass the environment variable <varname> with the value <value> to all MPI processes. -genvall (Default) Pass all environment variables to all MPI processes except for the default environment variables set by NQSV in the NQSV request execution or set by PBS in the PBS request execution. -genvlist <varname-list> Comma-separated list of environment variables to be passed to all MPI processes. -genvnone Do not pass any environment variables. -gpath <dirname> Set PATH environment variables passed to all MPI processes to <dirname>. -gumask <mode> Execute "umask <mode>" for all MPI processes. -gwdir <dirname> Set the working directory in which all MPI processes run to <dirname>. -gdb | -debug Open one debug screen per MPI process, and run MPI programs under the gdb debugger. -display | -disp <X-server> X display server for debug screens in the format "host:display" or "host:display:screen". -gvh | -gsh Specify that executables should run by default on Vector Hosts or Scalar Hosts
Note: When running some executables on VE, it is necessary to use an option such as -ve to indicate that the executables should run on VE.-vpin | -vpinning | -vnuma Print info on assigned cpu id's of MPI processes on VH's, scalar hosts or NUMA nodes on VEs.
This option is valid for -pin_mode, -cpu_list, -numa, -nnuma option.-v | -V | -version Display the version of NEC MPI and runtime information such as environment variables. -h | -help Display help for the MPI execution commands.
Only one of the local options in the following table can be specified to each MPI executable file. When all of them are omitted, the host specified in runtime options indicates a VH.
Table 3-4 The List of Local Options Local Option Meaning -ve <first>[-<last>] The range of VEs on which MPI processes are executed. If this option is specified, the term host in runtime options indicates a VH.
In the interactive execution, specify the range of VE numbers.
In the NQSV request execution, specify the range of logical VE numbers.
<first> indicates the first VE number, and <last> the last VE number. <last> must not be smaller than <first>. When -<last> is omitted, -<first> is assumed to be specified.
The specified VEs are the ones attached to VHs specified immediately before this option in local options or specified in global options.
If this option is omitted and no VEs are specified, VE#0 is assumed to be specified. If this option is omitted and host or the number of hosts are not specified in the NQSV request execution, all VEs assigned by NQSV are assumed to be specified.-nve <value> The number of VEs on which MPI processes are executed.
Corresponds to: -ve 0-<value-1>
The specified the number of VEs are the ones attached to VHs specified immediately before this option in local options or specified in global options.-venode The term host in the options indicates a VE. -vh | -sh Create MPI processes on Vector Hosts or Scalar hosts. -host <host> One host on which MPI processes are launched. -node <hostrange> The range of hosts on which MPI processes are launched. Please refer to this section for the format of <hostrange>.
In the interactive execution, the -venode option also needs to be specified.
This option must not be specified together with the option -host, -nn, or -node.-nn <value> The number of hosts on which MPI processes are launched.
This option can be specified only once corresponding to each MPI executable file.
If this option is omitted and host or the number of hosts are not specified, the total number of hosts assigned is assumed to be specified.
If the option -hosts, -hostfile, -f or -host is specified, this option is ignored.-numa <first>[-<last>][,<...>] The range of NUMA nodes on VE on which MPI processes are executed.
<first> indicates the first NUMA node number, and <last> the last NUMA node number. <last> must not be smaller than <first>. When -<last> is omitted, -<first> is assumed to be specified.
-nnuma <value> The number of NUMA nodes on VE on which MPI processes are executed.
Corresponds to: -numa 0-<value-1>-c | -n | -np <value> The total number of processes launched on the corresponding hosts.
The specified processes correspond to the hosts specified immediately before this option in local options or specified in global options.
When this option is omitted, the default value is 1.-ve_nnp | -nnp_ve | -vennp <value> The number of processes launched per VE.
This option is ignored where other options that specify the number of MPI processes to be launched, such as the -np option, -nnp option and so on, are specified. This option cannot be used where the -gvenode option or -venode option is specified.
When this option is omitted, the default value is 1.-env <varname> <value> Pass the environment variable <varname> with the value <value> to MPI processes. -envall (Default) Pass all environment variables to MPI processes except the default environment variables set by NQSV in the NQSV request execution or set by PBS in the PBS request execution. -envlist <varname-list> Comma-separated list of environment variables to be passed. -envnone Do not pass any environment variables. -path <dirname> Set PATH environment variables passed to MPI process to <dirname>. -umask <mode> Execute "umask <mode>" for MPI process. -wdir <dirname> Set the working directory in which MPI processes run to <dirname>. -ib_vh_memcpy_send <auto | on | off> Use VH memory copy on the sender side of a VE process for InfiniBand communication. This option has higher priority than the environment variable NMPI_IB_VH_MEMCPY_SEND.
auto:
Use sender side VH memory copy for InfiniBand communication through Root Complex.
(default for Intel machines)
on:
Use sender side VH memory copy for InfiniBand communication (independent on Root Complex).
(default for non-Intel machines)
off:
Don't use sender side VH memory copy for InfiniBand communication.-ib_vh_memcpy_recv <auto | on | off> Use VH memory copy on the receiver side of a VE process for InfiniBand communication. This option has higher priority than the environment variable NMPI_IB_VH_MEMCPY_RECV.
auto:
Use receiver side VH memory copy for InfiniBand communication through Root Complex.
on:
Use receiver side VH memory copy for InfiniBand communication (independent on Root Complex).
(default for non-Intel machines)
off:
Don't use receiver side VH memory copy for InfiniBand communication.
(default for Intel machines)
-dma_vh_memcpy <auto | on | off> Use VH memory copy for a communication between VEs in VH. This option has higher priority than the environment variable NMPI_DMA_VH_MEMCPY.
auto:
Use VH memory copy for a communication between VEs in VH through Root Complex.
(default)
on:
Use VH memory copy for a communication between VEs in VH.
(independent on Root Complex).
off:
Don't use VH memory copy for a communication between VEs in VH .
-vh_memcpy <auto | on | off> Use VH memory copy for the InfiniBand communication and the communication between VEs in VH. This option has higher priority than the environment variable NMPI_VH_MEMCPY.
auto:
In the case of InfiniBand communication, sender side VH memcpy is used if the communication goes through Root Complex. In the case of a communication between VEs in VH, VH memory copy is used if the communication goes through Root Complex.
on:
VH memory copy is used.
off:
VH memory copy is not used.
Note:
The option -ib_vh_memcpy_send, -ib_vh_memcpy_recv and -dma_vh_memcpy are higher priority than this option.
-vh_thread_yield <0 | 1 | 2> Control the waiting method for a VH process.
0:
Do the busy wait.
(default)
1:
Do the sched_yield().
2:
Do the sleep. It is implemented by pselect().
-vh_spin_count <spin count value> Control the spin count value for a VH process. The value must be greater than 0.
-vh_thread_sleep <sleep timeout value> Control the sleep microseconds timeout for a VH process.
-pin_mode < consec | spread |
consec_rev | spread_rev
scatter | no | none | off >Specify the method how the affinity of MPI processes on VH or scalar host is controlled with.
consec | spread :
Assign next freecpu ids to MPI processes. Assigning of cpu ids starts with cpu id 0.
consec_rev | spread_rev:
Assign next free (in reverse order)cpu ids to MPI processes. Assigning of cpu ids starts with highest cpu id.
scatter:
Look for a maximal distance to already assigned cpu ids and assign next freecpu ids to MPI processes.
none | off | no :
No pinning of MPI processes to cpu id's. The default pinning mode is 'none'.
Note:
(*) Specifying flag "-pin_mode" disables preceding "-cpu_list".
(*) If the number of free cpu id's is not sufficient to assigncpu_id's, NO cpu id is assigned to the MPI process. -pin_reserve <num-reserved-ids>[H|h] Specify the number of cpu ids to be reserved per MPI process on VH or scalar host for the pinning method specified with the flag "-pin_mode". If the optional 'h' or 'H' is added to the number, the cpu id's of associated Hyperthreads are also utilized if available.
The number of reserved ids must be greater than 0.
The default number is 1.-cpu_list | -pin_cpu <first-id>[<-last-id>
[<-increment>[-<num-reserved-ids>
[H|h][,...]]]]Specify a comma-separated list of cpu id's for the processes to be created. specifies the cpu id which is assigned to the first MPI process on the node. Cpu id <first-id + increment> is assigned to the next MPI process and so on. <last-id> specifies the last cpu id which is assigned. <num-reserved-ids> specifies the number of reserved cpu ids per MPI process for multithreaded application. If the optional 'h' or 'H' is added to the <num-reserved-ids>, the cpu ids of Hyperthreads are also utilized if available.
Default values if not specified:
<last-id> = <first-id>
<increment> = 1
<num-reserved-ids> = 1
Note:
(*) Specifying flag "-cpu_list" disables preceding "-pin_mode".
(*) If the number of free cpu ids is not sufficient to assign <num-reserved-ids> cpu ids, NO cpu id is assigned to the MPI process.-veo To specify for MPI processes to use VEO features -cuda To specify for MPI processes to use CUDA features
- When all of the options -hosts, -hostfile, -f, -host, -node, and -nn are omitted in the NQSV request execution, all the hosts allocated by NQSV are used.
- In the PBS request execution, -machinefile, -machine, -hosts, -hostfile, -f, -gvenode, -perhost, -ppn, -N, -npernode, -nnp, -ve, -nve, -venode, -ve_nnp, -nnp_ve, -vennp, -host, -node, -nn are not available.
- The precedence of the options -hosts, -hostfile, -f, -host, -node, and -nn is
-hosts, -hostfile, -f, -host > -nn > -node.- The following local options have higher priority than the following global options.
Local options : -env, -envall, -envlist, -envnone, -path, -umask, -wdir
Global options : -genv, -genvall, -genvlist, -genvnone, -gpath, -gumask, -gwdir
3.2.3   Specification of Hosts
Hosts corresponding to
MPI executable files are determined according to the specified runtime options as follows:
A host indicates a VH in this case. VHs are specified as shown in the following table.
Table 3-5 Specification of VHs Execution Method Format Description Interactive execution VH name
- The hostname of a VH, which is a host computer.
NQSV request execution <first>[-<last>]
- <first> is the first logical VH number and <last> the last.
- To specify one VH, omit -<last>.
In particular specify only <first> in the options -hosts, -hostfile, -f and -host.- <last> must not be smaller than <first>.
A host indicates a VE in this case. VEs are specified as shown in
the following table.
Please note that the -ve option cannot be specified for the MPI
executable file for which the -venode option is specified.
Table 3-6 Specification of VEs Execution Method Format Description Interactive execution <first>[-<last>][@<VH>]
- <first> is the first VE number and <last> the last.
- <VH> is a VH name. When omitted, the VH on which the MPI execution command has been executed is selected.
- To specify one VE, omit -<last>.
In particular specify only <fisrt> in the options -hosts, -hostfile, -f and -host.- <last> must not be smaller than <first>.
NQSV request execution <first>[-<last>][@<VH>]
- <first> is the first logical VE number and <last> the last.
- <VH> is a logical VH number. When omitted, hosts (VEs) are selected from the ones NQSV allocated.
- To specify one VE, omit -<last>.
In particular specify only <first> in the options -hosts, -hostfile, -f and -host.- <last> must not be smaller than <first>.
3.2.4   Environment Variables
The following Table shows the environment variable s the values of which users can set. The name of an environment variable in NEC MPI starts with NMPI_, and some of them provide names that start with MPI. Additionally, the behavior and output of MPI runtime performance information may vary depending on the NEC MPI-unrelated environment variables that start with VE_ described in the table, such as VE_PROGINF_USE_SIGNAL and VE_PERF_MODE. Environment variables that start with NMPI_ can be referred to in the help of the mpirun and mpiexec command.
Environment Variable | Available Value | Meaning |
---|---|---|
NMPI_COMMINF | Control the display of MPI communication information. To use MPI communication information facility, you need to generate MPI program with the option -mpiprof, -mpitrace, -mpiverify or -ftrace. Please refer to this section for MPI communication facility. | NO | (Default) Not display the communication information. | YES | Display the communication information in the reduced format. | ALL | Display the communication information in the extended format. |
MPICOMMINF | The same as the environment variable NMPI_COMMINF | The same as the environment variable NMPI_COMMINF. If both are specified, the environment variable NMPI_COMMINF takes precedence. |
NMPI_COMMINF_VIEW | Specify the display format of the aggregated portion of MPI communication information. | VERTICAL | (Default) Aggregate vector processes and scalar processes separately and display them vertically. | HORIZONTAL | Aggregate vector processes and scalar processes separately and display them horizontally. | MERGED | Aggregate and display vector processes and scalar processes. |
NMPI_PROGINF | Control the display of runtime performance information of MPI program. Please refer to this section for runtime performance information of MPI program. | NO | (Default) Not display the performance information. | YES | Display the performance information in the reduced format. | ALL | Display the performance information in the extended format. | DETAIL | Display the detailed performance information in the reduced format. | ALL_DETAIL | Display the detailed performance information in the extended format. |
MPIPROGINF | The same as the environment variable NMPI_PROGINF | The same as the environment variable NMPI_PROGINF. If both are specified, the environment variable NMPI_PROGINF takes precedence. |
NMPI_PROGINF_VIEW | Specify the display format of the aggregated portion about VE of runtime performance information of MPI program. | VE_SPLIT | Aggregate processes executed on VE30 and processes executed on VE10/VE10E/VE20 separately and display them. | VE_MERGED | (Default) Aggregate all processes executed on VE togather as vector processes and display it. |
NMPI_PROGINF_COMPAT | 0 | (Default) The runtime performance information of MPI program is displayed in the latest format. | 1 | The runtime performance information of MPI program is displayed in old format. In this format, performance item "Non Swappable Memory Size Used", VE Card Data section and location information of VE where the MPI process is executed are not displayed. |
VE_PROGINF_USE_SIGNAL | YES | (Default) Signals are used for collecting performance information. | NO | Signals are not used for collecting performance information. See this section before using this option. |
VE_PERF_MODE | Control the HW performance counter set. MPI performance information outputs items corresponding to selected counters. | |
VECTOR-OP | (Default) Select the set of HW performance counters related to vector operation mainly. | |
VECTOR-MEM | Select the set of HW performance counters related to vector and memory access mainly. | |
NMPI_EXPORT | "<string>" | Space-separated list of the environment variables to be passed to MPI processes. |
MPIEXPORT | The same as the environment variable NMPI_EXPORT | The same as the environment variable NMPI_EXPORT. If both are specified, the environment variable NMPI_EXPORT takes precedence. |
NMPI_SEPSELECT | To enable this environment variable, the shell script mpisep.sh must also be used. Please refer to this section for details. | 1 | The standard output from each MPI process is saved in a separate file. | 2 | (Default) The standard error output from each MPI process is saved in a separate file. | 3 | The standard output and standard error output from each MPI process are saved in respective separate files. | 4 | The standard output and standard error output from each MPI process are saved in one separate file. |
MPISEPSELECT | The same as the environment variable NMPI_SEPSELECT | The same as the environment variable NMPI_SEPSELECT. If both are specified, the environment variable NMPI_SEPSELECT takes precedence. |
NMPI_VERIFY | Control error detection of the debug assist feature for MPI collective procedures. To use the feature for MPI collective procedures, you need to generate MPI program with the option -mpiverify. Please refer to this content for the feature. | 0 | Errors in invocations of MPI collective procedures are not detected. | 3 | (Default) Errors other than those in the argument assert of the procedure MPI_WIN_FENCE are detected. | 4 | Errors in the argument assert of the procedure MPI_WIN_FENCE are detected, in addition to the default errors. |
NMPI_VE_TRACEBACK | Controls format of traceback output by the VE MPI. | |
ON | Output traceback in the same format as NEC compiler when the environment variable VE_TRACEBACK is set to VERBOSE. | |
OFF | Output traceback in the same format as backtrace_symbols. (default) | |
NMPI_TRACEBACK_DEPTH | <integer> | Controls the maximum depth of traceback output by MPI. (default:50) 0 has special meaning: The maximum depth is unlimited in the case of VE MPI. The maximum depth is at least 50 in the case of VH MPI. |
NMPI_OUTPUT_COLLECT | Controls the output of MPI programs when the NEC MPI process manager in the queue settings is hydra when executing NQSV batch jobs. | |
ON | The output of the MPI program is set as the standard output and standard error output of the MPI execution command. This setting takes precedence over qsub -f. | |
OFF | The output of the MPI program is output for each logical node as in the case of mpd.(default) | |
NMPI_BLOCKLEN0 | OFF | (Default) Blocks with blocklength 0 are not included in the calculation of the values of the lower bound and upper bound of a datatype created by MPI procedures that create derived datatypes and have the argument blocklength. | ON | Blocks with blocklength 0 are also included in the calculation of the values of the lower bound and upper bound of a datatype created by MPI procedures that create derived datatypes and have the argument blocklength. |
MPIBLOCKLEN0 | The same as the environment variable NMPI_BLOCKLEN0 | The same as the environment variable NMPI_BLOCKLEN0. If both are specified, the environment variable NMPI_BLOCKLEN0 takes precedence. |
NMPI_COLLORDER | OFF | (Default)
|
ON | Canonical order, bracketing independent of process distribution, dependent only on the number of processes. | |
MPICOLLORDER | The same as the environment variable NMPI_COLLORDER | The same as the environment variable NMPI_COLLORDER. If both are specified, the environment variable NMPI_COLLORDER takes precedence. |
NMPI_PORT_RANGE |
|
The range of port numbers NEC MPI uses to accept TCP/IP
connections. The default value is 25257:25266. |
NMPI_INTERVAL_CONNECT |
|
Retry interval in seconds for establishing connections among
MPI daemons at the beginning of execution of MPI programs. The default value is 1. |
NMPI_RETRY_CONNECT |
|
The number of retries for establishing connections among
MPI daemons at the beginning of execution of MPI programs. The default value is 2. |
NMPI_LAUNCHER_EXEC |
|
Full path name of the remote shell that launches
MPI daemons. The default value is /usr/bin/ssh. This environment variable is only available only in the interactive execution. |
NMPI_IB_ADAPTER_NAME |
|
Comma-or-Space separated list of InfiniBand adaptor names
NEC MPI uses. This environment variable is available only in the interactive execution. When omitted, NEC MPI automatically selects the optimal ones. |
NMPI_IB_DEFAULT_PKEY |
|
Partition key for InfiniBand Communication. The default value is 0. |
NMPI_IB_FAST_PATH | ON |
Use InfiniBand RDMA fath path feature to transfer eager messages. (Default on Intel machines) Don't set this value if InfiniBand HCA Relaxed Ordering or Adaptive Routing is enabled. |
MTU |
MTU limits the message size of fast path feature to actual OFED mtu size. Don't set this value if InfiniBand HCA Relaxed Ordering is enabled. |
OFF |
Don't use InfiniBand RDMA fath path feature. (Default on Non-Intel machines) |
NMPI_IB_VBUF_TOTAL_SIZE |
|
Size of each InfiniBand communication buffer in bytes. The default value is 12248. |
NMPI_IB_VH_MEMCPY_SEND | AUTO | Use sender side VH memory copy for InfiniBand communication
through Root Complex. (default for Intel machines) |
ON | Use sender side VH memory copy for InfiniBand communication
(independent on Root Complex). (default for non-Intel machines) |
OFF | Don't use sender side VH memory copy for InfiniBand communication. |
NMPI_IB_VH_MEMCPY_RECV | AUTO | Use receiver side VH memory copy for InfiniBand communication
through Root Complex. |
ON | Use receiver side VH memory copy for InfiniBand communication
(independent on Root Complex). (default for non-Intel machines) |
OFF | Don't use receiver side VH memory copy for InfiniBand communication. (default for Intel machines) |
NMPI_DMA_VH_MEMCPY | AUTO | Use VH memory copy for a communication between VEs in VH through Root Complex. (Default) |
ON | Use VH memory copy for a communication between VEs in VH. |
OFF | Don't use VH memory copy for a communication between VEs in VH. |
NMPI_VH_MEMCPY | AUTO | In the case of InfiniBand communication,
sender side VH memcpy is used
if the communication goes through Root Complex.
In the case of a communication between VEs in VH,
VH memory copy is used
if the communication goes through Root Complex. |
ON | VH memory copy is used. | OFF | VH memory copy is not used. |
Note: NMPI_IB_VH_MEMCPY_SEND, NMPI_IB_VH_MEMCPY_RECV, NMPI_DMA_VH_MEMCPY are higher priority than this environment variable. |
NMPI_DMA_RNDV_OVERLAP | ON | In the case of DMA communication, the communication and calculation can overlap when the buffer is contiguous, its transfer length is 200KB or more, and non-blocking point-to-point communication is selected. | OFF | (Default) In the case of DMA communication, the communication and calculation does not overlap even when the transfer length is 200KB or more and non-blocking point-to-point communication is selected. |
Note: Setting NMPI_DMA_RNDV_OVERLAP to ON disables the usage of VH memory copy. In this case, the values of environment variables NMPI_DMA_VH_MEMCPY is ignored. |
NMPI_IB_VH_MEMCPY_THRESHOLD |
|
Minimal message size to transfer InfiniBand message to/from VE processes via VH memory. Smaller messages are sent directly without copy to/from VH memory. Message size is given in bytes and must be greater or equal to 0. The default value is 1048576. This value corresponds the following item output by specifying runtime option "-v": "Threshold" of "IB Parameters for message transfer via VH memory" |
NMPI_IB_VH_MEMCPY_BUFFER_SIZE |
|
Maximal size of a buffer located in VH memory to transfer (parts of) an InfiniBand message to/from VE processes. Size of buffer is given in bytes and must be at least 8192 bytes. The default value is 1048576. This value corresponds the following item output by specifying runtime option "-v": "Buffer size" of "IB Parameters for message transfer via VH memory" |
NMPI_IB_VH_MEMCPY_SPLIT_THRESHOLD |
|
Minimal message size to split transfer of InfiniBand messages to/from VE processes via VH Memory. The messages are split in nearly equal parts in order to increase the transfer bandwidth. Message size is given in bytes and must be greater or equal to 0. The default value is 1048576. This value corresponds the following item output by specifying runtime option "-v": "Split threshold" of "IB Parameters for message transfer via VH memory" |
NMPI_IB_VH_MEMCPY_SPLIT_NUM |
|
Maximal number of parts used to transfer InfiniBand messages to/from VE processes using VH memory. The number must be in range of [1:8]. The default value is 2. This value corresponds the following item output by specifying runtime option "-v": "Split number" of "IB Parameters for message transfer via VH memory" |
NMPI_IP_USAGE | TCP/IP usage if fast InfiniBand interconnect is not available on an InfiniBand system(for example, if InfiniBand ports are down or no HCA was assigned to a job). | |
ON | FALLBACK | Use TCP/IP as fallback for fast InfiniBand interconnect. | |
OFF | (Default) Terminate application if InfiniBand interconnect is not available on a InfiniBand system. | |
NMPI_EXEC_MODE | NECMPI | (Default) Work with NECMPI runtime option. | INTELMPI | Work with IntelMPI's basic runtime options (see below). | OPENMPI | Work with OPENMPI's basic runtime options (see below). | MPICH | Work with MPICH's basic runtime options (see below). | MPISX | Work with MPISX's runtime options. |
NMPI_SHARP_ENABLE | ON | To use SHARP |
OFF | Not to use SHARP (default) | |
NMPI_SHARP_NODES | <integer> | The minimal number of VE nodes to use SHARP if SHARP usage is enabled. (default: 4) |
NMPI_SHARP_ALLREDUCE_MAX | <integer> | Maximal data size (in bytes) in MPI_Allreduce for which the SHARP API used. (Default: 64) |
UNLIMITED | SHARP is always used. | |
NMPI_SHARP_REPORT | ON | Report on MPI Communicators using SHARP collective support. |
OFF | No report. (default) | |
NMPI_DCT_ENABLE |
Control the usage of Inifniband DCT (Dynamically Connected Transport Service).
Using DCT, a memory usage for Inifniband communication is reduced. (Note: DCT may affect a performance of InfiniBand communication) |
|
AUTOMATIC | DCT is used if the number of MPI processes is equal or greater than the number specified by NMPI_DCT_SELECT_NP environment variable. (default) | |
ON | DCT is always used. | |
OFF | DCT is not used. | |
NMPI_DCT_SELECT_NP | <integer> | The minimal number of MPI processes that DCT is used if the environment variable NMPI_DCT_ENABLE is set to AUTOMATIC. The default value is automatically decided by the number of cores in one VE and the number of VEs in one VH. (up to 2049) |
NMPI_DCT_NUM_CONNS | <integer> | The number of requested DCT connections. (default: 4) |
NMPI_COMM_PNODE | Control the automatic selection of communication type between logical nodes in the execution under NQSV. | |
OFF | Select the communication type automatically based on the logical node (default). | |
ON | Select the communication type automatically based on the physical node. | |
NMPI_EXEC_LNODE |
Control the logical node execution in the interactive execution. In the logical node execution,
the communication is selected automatically based on the logical node.
The format of the specified logical node is "hostname/string". The following example shows how to execute a program on 3 logical nodes in using 1 physical node. $ mpirun -host HOST1 -ve 0 -host HOST1/A -ve 1 -host HOST1/B -ve 2 ve.out |
|
OFF | The logical node execution in interactive execution is not used (default). | |
ON | The logical node execution in interactive execution is used. | |
NMPI_LNODEON MPILNODEON |
The same as the environment variable NMPI_EXEC_LNODE. If both are specified, the environment variable NMPI_EXEC_LNODE takes precedence. | |
1 | The logical node execution in interactive execution is used. | |
NMPI_VH_MEMORY_USAGE | VH memory usage required for MPI application execution. | |
ON | (Default) VH Memory is required. If VH memory is requested and not available, the MPI application is aborted. | |
OFF | FALLBACK | If VH Memory is requested and not available, a possibly slower communication path is used. | |
NMPI_CUDA_ENABLE | Control the usage of CUDA memory transfer. | |
AUTO | CUDA memory transfer is used if it is available(default). | |
OFF | CUDA memory transfer is not used. | |
ON | CUDA memory transfer is used. The MPI application is aborted when CUDA memory transfer is not available. | |
NMPI_GDRCOPY_ENABLE | Control whether GDRCopy is used for data transfer between GPU and VH in a node. | |
AUTOMATIC | Transfer data using GDRCopy if it is available(default). | |
ON | Transfer data using GDRCopy. If GDRCopy is not available, MPI application will be aborted. | |
OFF | Do not transfer data using GDRCopy. | |
NMPI_GDRCOPY_LIB |
|
Path to the GDRCopy dynamic library. |
NMPI_GDRCOPY_FROM_DEVICE_LIMIT |
|
Maximal transfer size for usage of GDRCopy from GPU memory to Host memory. The default value is 8192. |
NMPI_GDRCOPY_TO_DEVICE_LIMIT |
|
Maximal transfer size for usage of GDRCopy from Host memory to GPU memory. The default value is 8192. |
NMPI_VE_AIO_METHOD | Controls asynchronous I/O method used by non-blocking MPI-IO procedures of VE MPI programs. | |
VEAIO | Use VE AIO (default) | |
POSIX | Use POAIX AIO | |
NMPI_SWAP_ON_HOLD | When the Partial Process Swapping function is used in order to suspend a regular request in NQSV job, it controls the release of a Non Swappable Memory used by a MPI process. The default value depends on the system settings. You can check it in the ⌈ Swappable on hold ⌋ item which is displayed when you specify the runtime option -v. | |
ON | A part of the Non Swappable Memory used by the MPI process is released. | |
OFF | A Non Swappable Memory used by the MPI process is not released. | |
NMPI_AVEO_UDMA_ENABLE | Control the AVEO UserDMA feature. | |
ON | Enable the AVEO UserDMA feature (default) | |
OFF | Disable the AVEO UserDMA feature | |
NMPI_USE_COMMAND_SEARCH_PATH |
Controls whether the PATH environment variable is used
in order to search for the executable file specified in the MPI execution command. (*) If you specify a file path that includes path separators instead of the file name, it will not be affected by this environment variable. |
|
ON | Use the PATH environment variable. The file is searched for in order from the beginning of the directory specified in the PATH environment variable. |
|
OFF | Do not use the PATH environment variable. The file is searched for only in the current working directory. (default) |
|
NMPI_OUTPUT_RUNTIMEINFO | Control the frequency of output for runtime information displayed by -v option of MPI execution command when executing the NQSV batch job in the queue that mpd is selected as NEC MPI Process Manager. | |
ON | Output runtime information every runtime of MPI execution command with -v option in the job script.(default) | |
OFF | Output runtime information only at the runtime of the first MPI execution command even if there are many MPI execution commands with -v option in the job script. | |
NMPI_IB_CONNECT_IN_INIT | Control timing of establishing InfiniBand connection. The default can be changed to AUTO by specifying "ib_connect_in_init auto" in /etc/opt/nec/ve/mpi/necmpi.conf . | |
ON | All of connections are established in MPI_Init(). The performance of first collective communication may be improved. | |
OFF | Each of connections are established when first communication is issued. (default) | |
AUTO | This feature is enabled when the number of processes is 4096 or more. | |
NMPI_VH_THREAD_YIELD | Control the waiting method for a VH process. | |
0 | Do the busy wait.(default) | |
1 | Do the sched_yield(). | |
2 | Do the sleep. It is implemented by pselect(). | |
NMPI_VH_SPIN_COUNT | <integer> | Control the spin count value for a VH process. The value must be greater than 0. (Default: 100) |
NMPI_VH_THREAD_SLEEP | <integer> | Control the sleep microseconds timeout for a VH process. (Default: 100) |
NMPI_IB_MEDIUM_BUFFERING | Use buffering for reducing Non Swappable Memory when the transfer is issued over InfiniBand, and the transfer size is equal to or larger than NMPI_IB_VBUF_TOTAL_SIZE and less than NMPI_IB_VH_MEMCPY_THRESHOLD. | |
AUTO | Buffering is used when NMPI_SWAP_ON_HOLD=ON.(default) | |
ON | Buffering is used. | |
OFF | Buffering is not used. | |
NMPI_ALLOC_MEM_LOCAL | Controls whether local memory is allocated in the MPI procedures MPI_Alloc_mem,MPI_Win_allocate,
and MPI_Win_allocate_shared (only with a single process). Note: Local memory is not available for some high performance communications such as direct transfers of RMA. Global memory is Non Swappable Memory during Switch Over. |
|
ON | Local memory is allocated. | |
OFF | Global memory is allocated. (default) | |
NMPI_IB_GPUDIRECT_ENABLE | Controls the GPUDirect RDMA feature. | |
AUTO | Enable the GPUDirect RDMA feature if GPU and InfiniBand HCA are connected under the same PCIe Root Port. (default) | |
ON | Enable the GPUDirect RDMA feature regardless of the PCIe topology. | |
OFF | Disable the GPUDirect RDMA feature. | |
NMPI_GDRCOPY_GPUDIRECT_THRESHOLD | <integer> | The threshold transfer size to change GDRCopy to GPUDirect RDMA. (Default: 128) |
NMPI_VE_USE_256MB_MEM | Controls the usage of the memory managed in units of 256 MB. Note: This affects only MPI processes executed on VE |
|
ON | Use the memory managed in unit of 256MB | |
OFF | Don't use the memory managed in unit of 256MB | |
AUTO | (Default) The respective processes have the different value as follows:
|
|
NMPI_VE_ALLOC_MEM_BACKEND | Specify the function used by the memory management of MPI_Alloc_mem when it allocates memory. Note: This affects only MPI processes executed on VE |
|
MALLOC | Use malloc and variants | |
MMAP | Use mmap and variants | |
AUTO | (Default) The respective processes have the different value as follows:
|
|
NMPI_IB_RNDV_PROTOCOL | Specifies the type of InfiniBand Transfer that NEC MPI mainly uses for MPI communication. | |
PUT or RPUT | RDMA-WRITE is mainly used. | |
GET or RGET | RDMA-READ is mainly used. | |
AUTO | (Default) Either RDMA-WRITE or RDMA-READ is automatically selected according to the system configuration, distribution and layout of MPI processes in the program execution. | |
NMPI_IB_RMA_PUT_PROTOCOL | Specify the transfer type of InfiniBand communication with MPI_Put and MPI_Rput procedures. | |
RDMA | (Default) RDMA-WRITE is used if possible. | |
PT2PT | Point-to-Point communication is used. | |
PT2PT_DCT or PT2PT4DCT | Point-to-Point communication is used if a DCT connection between both processes is active. | |
REMOTE_GET | RDMA-READ is used if possible. | |
NMPI_IB_RMA_PUT_THRESHOLD | Specify minimal transfer size at which the transfer type selected by environment variable NMPI_IB_RMA_PUT_PROTOCOL is used in MPI_Put and MPI_Rput procedures. When the transfer size is smaller than the value of NMPI_IB_RMA_PUT_THRESHOLD , RDMA is used. (Default: 0) |
3.2.5   Environment Variables for MPI Process Identification
NEC MPI provides the following environment variables, the values of which are automatically set by NEC MPI, for MPI process identification.
Environment Variable | Value |
---|---|
MPIUNIVERSE | The identification number of the predefined communication universe at the beginning of program execution corresponding to the communicator MPI_COMM_WORLD. |
MPIRANK | The rank of the executing process in the communicator MPI_COMM_WORLD at the beginning of program execution. |
MPISIZE | The total number of processes in the communicator MPI_COMM_WORLD at the beginning of program execution. |
MPINODEID | The logical node number of node where the MPI process is running. |
MPIVEID | The VE node number of VE where the MPI process is running. If the execution is under NQSV, this shows logical VE node number. If the MPI process is not running on VE, this variable is not set. |
NMPI_LOCAL_RANK | The relative rank of MPI process in MPI_COMM_WORLD on this node. |
NMPI_LOCAL_RANK_VHVE | The relative rank of MPI process in MPI_COMM_WORLD on host CPUs or VE cards of this node. |
NMPI_LOCAL_RANK_DEVICE | The relative rank of MPI process in MPI_COMM_WORLD on host CPUs or a VE card of this node. |
These environment variables can be referenced whenever MPI programs are running including before the invocation of the procedure MPI_INIT or MPI_INIT_THREAD.
When an MPI program is initiated, there is a predefined communication universe that includes all MPI processes and corresponds to a communicator MPI_COMM_WORLD. The predefined communication universe is assigned the identification number 0.
In a communication universe, each process is assigned an unique integer value called rank, which is in the range zero to one less than the number of processes.
If the dynamic process creation facility is used and a set of MPI processes
is dynamically created, a new communication universe corresponding to a new
communicator MPI_COMM_WORLD is created.
Processes in each communication universe created at runtime are
assigned consecutive integer identification numbers starting at 1.
In such a case, two or more MPI_COMM_WORLDs can exist at the same time
in a single MPI application.
Therefore, an MPI process can be identified
using a pair of values of MPIUNIVERSE and MPIRANK.
In the case of Aurora system, MPI processes run on host CPUs or VE cards that are components of a node. By the environment variable MPINODEID, MPIVEID, NMPI_LOCAL_RANK, NMPI_LOCAL_RANK_VHVE and NMPI_LOCAL_RANK_DEBVICE, you can get the location where the MPI process run, and the unique number of the MPI process to the MPI process group within a node, the CPU side, the VE side or a VE. The environment variable MPIRANK indicates the unique number to the MPI process group of MPI_COMM_WORLD, but the environment variable NMPI_LOCAL_RANK, NMPI_LOCAL_RANK_VHVE and NMPI_LOCAL_RANK_DEVICE indicate the unique number to the smaller group which MPI_COMM_WORLD is split into as follows.
mpirun \ -host hostA -vh -np 2 ./vh.out : \ -host hostA -ve 0-1 -np 4 ./ve.out : \ -host hostB -ve 0-1 -np 6 ./ve.out : \ -host hostB -vh -np 2 ./vh.out MPIRANK 0 1 2 3 4 5 6 7 8 9 10 11 12 13 NMPI_LOCAL_RANK 0 1 2 3 4 5 0 1 2 3 4 5 6 7 NMPI_LOCAL_RANK_VHVE 0 1 0 1 2 3 0 1 2 3 4 5 0 1 NMPI_LOCAL_RANK_DEVICE 0 1 0 1 0 1 0 1 2 0 1 2 0 1 MPINODEID 0 0 0 0 0 0 1 1 1 1 1 1 1 1 MPIVEID - - 0 0 1 1 0 0 0 1 1 1 - - |
If an MPI program is indirectly initiated with a shell script, these environment variables can also be referenced in the shell script and be used, for example, to allow different MPI processes to handle mutually different files. The shell script in the figure makes each MPI process read data from respectively different files and store data to respectively different files, and it is executed as shown in the figure.
#!/bin/sh INFILE=infile.$MPIUNIVERSE.$MPIRANK OUTFILE=outfile.$MPIUNIVERSE.$MPIRANK {MPIexec} < $INFILE > $OUTFILE # Refer to this clause for {MPIexec}, MPI-execution specification exit $? |
$ mpirun -np 8 /execdir/mpi.shell |
3.2.6   Environment Variables for Other Processors
The environment variables supported by other processors such as the Fortran compiler (nfort), C compiler (ncc), or C++ compiler (nc++) are passed to MPI processes because runtime option -genvall is enable by default. In the following example, OMP_NUM_THREADS and VE_LD_LIBRARY_PATH are passed to MPI processes.
#!/bin/sh
#PBS -T necmpi
#PBS -b 2
OMP_NUM_THREADS=8 ; export OMP_NUM_THREADS
VE_LD_LIBRARY_PATH={your shared library path} ; export VE_LD_LIBRARY_PATH
mpirun -node 0-1 -np 2 a.out
3.2.7   Rank Assignment
Ranks are assigned in the ascending order to MPI processes according to the order that NEC MPI assigns them to hosts.
3.2.8   The Working Directory
The working directory is determined as follows:
3.2.9   Execution with the apptainer container
You can execute MPI programs in the apptainer, formerly known as singularity, container.
As the following example, apptainer command is specified as an argument of mpirun command.
$ mpirun -ve 0 -np 8 /usr/bin/apptainer exec --bind /var/opt/nec/ve/veos ./nmpi.sif ./ve.out |
3.2.10   Execution Examples
The following examples show how to launch MPI programs on the
SX-Aurora TSUBASA.
$ mpirun -ve 3 -np 4 ./ve.out |
$ mpirun -ve 0-7 -np 16 ./ve.out |
$ mpirun -hosts host1,host2 -ve 0-1 -np 32 ./ve.out |
$ mpirun -host host1 -ve 0-1 -np 16 -host host2 -ve 2-3 -np 16 ./ve.out |
$ mpirun -vh -host host1 -np 8 vh.out : -host host1 -ve 0-1 -np 16 ./ve.out |
Assignment of MPI processes to VEs and VHs is automatically performed by NQSV and users can only specify logical numbers of them.
The following examples show the content of job script files in the batch job, but the commands in the scripts are available in the interactive job, too.
#PBS -T necmpi #PBS -b 2 #PBS --venum-lhost=4 # Number of VEs source /opt/nec/ve/mpi/3.2.0/bin/necmpivars.sh (For VE30: source /opt/nec/ve3/mpi/3.2.0/bin/necmpivars.sh) mpirun -host 0 -ve 0-3 -np 32 ./ve.out |
#PBS -T necmpi #PBS -b 4 # Number of VHs #PBS --venum-lhost=8 # Number of VEs #PBS --use-hca=1 # Number of HCAs source /opt/nec/ve/mpi/3.2.0/bin/necmpivars.sh (For VE30: source /opt/nec/ve3/mpi/3.2.0/bin/necmpivars.sh) mpirun -np 32 ./ve.out |
#PBS -T necmpi #PBS -b 4 # Number of VHs #PBS --venum-lhost=8 # Number of VEs #PBS --use-hca=1 # Number of HCAs source /opt/nec/ve/mpi/3.2.0/bin/necmpivars.sh (For VE30: source /opt/nec/ve3/mpi/3.2.0/bin/necmpivars.sh) mpirun -vh -host 0 -np 1 vh.out : -np 32 ./ve.out |
This section explains how to run programs for the SX-Aurora TSUBASA under PBS. The description assumes that PBS installed on the system has been configured for the SX-Aurora TSUBASA. Refer to the chapter "Support for NEC SX-Aurora TSUBASA" in "Altair PBS Professional Administrator's Guide" for the configuration. This section illustrates the most basic usage. Refer to the chapter "Submitting Jobs to NEC SX-Aurora TSUBASA" in "Altair PBS Professional User's Guide" for advanced usage.
In jobscript files, specify resources you use in the PBS directive starting with the prefix "#PBS " as the following example shows, in which the resources nves and mpiprocs specify the number of VEs and that of MPI processes, respectively, resulting in execution of eight MPI processes on four VEs. The PBS directive starting with "-l select" is called a selection directive.
#PBS -l select=nves=4:mpiprocs=8 |
#PBS -l select=4:nves=1:mpiprocs=2 |
#!/bin/bash #PBS -l select=4:nves=1:mpiprocs=8 source /opt/nec/ve/mpi/3.2.0/bin/necmpivars.sh (For VE30: source /opt/nec/ve3/mpi/3.2.0/bin/necmpivars.sh) mpirun -np 32 ./ve.out |
#!/bin/bash #PBS -l select=8:nves=1:mpiprocs=2:ompthreads=4 source /opt/nec/ve/mpi/3.2.0/bin/necmpivars.sh (For VE30: source /opt/nec/ve3/mpi/3.2.0/bin/necmpivars.sh) mpirun -np 16 ./ve.out |
#!/bin/bash #PBS -l select=ncpus=2:mpiprocs=2+8:nves=1:mpiprocs=4 #PBS -v NEC_PROCESS_DIST=s2+4 source /opt/nec/ve/mpi/3.2.0/bin/necmpivars.sh (For VE30: source /opt/nec/ve3/mpi/3.2.0/bin/necmpivars.sh) mpirun -vh -np 2 vh.out : -np 32 ./ve.out |
$ mpirun -veo -np 8 ./mpi-veo |
$ mpirun -cuda -np 8 ./mpi-cuda |
3.3   Standard Output and Standard Error of MPI Programs
To separate output streams from MPI processes, NEC MPI provides the shell script mpisep.sh, which is placed in the path
/opt/nec/ve/bin/.
It is possible to redirect output streams from MPI processes into respectively different files in the current working directory by specifying this script before MPI-execution specification {MPIexec} as shown in the following example. (Please refer to this clause for MPI-execution specification {MPIexec}.)
$ mpirun -np 2 /opt/nec/ve/bin/mpisep.sh {MPIexec} |
The destinations of output streams can be specified with the environment variable NMPI_SEPSELECT as shown in the following table, in which uuu is the identification number of the predefined communication universe corresponding to the communicator MPI_COMM_WORLD and rrr is the rank of the executing MPI process in the universe.
NMPI_SEPSELECT Action 1 Only the stdout stream from each process is put into the separate file stdout.uuu:rrr. 2 (Default) Only the stderr stream from each process is put into the separate file stderr.uuu:rrr. 3 The stdout and stderr streams from each process are put into the separate files stdout.uuu:rrr and stderr.uuu:rrr, respectively. 4 The stdout and stderr streams from each process are put into one separate file std.uuu:rrr.
3.4   Runtime Performance of MPI Programs
The performance of MPI programs can be obtained
with the environment variable NMPI_PROGINF.
There are four formats of runtime performance information available in NEC MPI as
follows:
The format of displayed information can be specified by setting the environment variable NMPI_PROGINF at runtime as shown in the following table.
Format Description Reduced Format This format consists of three parts: The first part is the Global Data section in which maximum, minimum and average performance of all MPI processes is displayed. The second part is the Overall Data section which performance of overall MPI processes is displayed. The third part is the VE Card section in which maximum, minimum and average performance of VE card is displayed. The results of the vector processes and scalar processes are output separately. Extended Format Performance of each MPI process is displayed in the ascending order of their ranks in the communicator MPI_COMM_WORLD after the information in the reduced format. Detailed Reduced Format This format consists of three parts: The first part is the Global Data section in which maximum, minimum, and average detailed performance of all MPI processes is displayed. The second part is the Overall Data section in which performance of overall MPI processes is displayed. The third part is the VE Card section in which maximum, minimum and average performance of VE card is displayed. The results of the vector processes and scalar processes are output separately. Detailed Extended Format Detailed performance of each MPI process is displayed in the ascending order of their ranks in the communicator MPI_COMM_WORLD after the information in the detailed reduced format.
NMPI_PROGINF | Displayed Information |
---|---|
NO | (Default) No Output |
YES | Reduced Format |
ALL | Extended Format |
DETAIL | Detailed Reduced Format |
ALL_DETAIL | Detailed Extended Format |
NMPI_PROGINF_VIEW | Displayed Information |
---|---|
VE_SPLIT | Aggregate processes executed on VE30 and processes executed on VE10/VE10E/VE20 separately and display them. |
VE_MERGED | (Default) Aggregate all processes on VE together as vector processes and display it. |
MPI Program Information: ======================== Note: It is measured from MPI_Init till MPI_Finalize. [U,R] specifies the Universe and the Process Rank in the Universe. Times are given in seconds. Global Data of 4 Vector processes : Min [U,R] Max [U,R] Average ================================= Real Time (sec) : 25.203 [0,3] 25.490 [0,2] 25.325 User Time (sec) : 199.534 [0,0] 201.477 [0,2] 200.473 Vector Time (sec) : 42.028 [0,2] 42.221 [0,1] 42.104 Inst. Count : 94658554061 [0,1] 96557454164 [0,2] 95606075636 V. Inst. Count : 11589795409 [0,3] 11593360015 [0,0] 11591613166 V. Element Count : 920130095790 [0,3] 920199971948 [0,0] 920161556564 V. Load Element Count : 306457838070 [0,1] 306470712295 [0,3] 306463228635 FLOP Count : 611061870735 [0,3] 611078144683 [0,0] 611070006844 MOPS : 6116.599 [0,2] 6167.214 [0,0] 6142.469 MOPS (Real) : 48346.004 [0,2] 48891.767 [0,3] 48624.070 MFLOPS : 3032.988 [0,2] 3062.528 [0,0] 3048.186 MFLOPS (Real) : 23972.934 [0,2] 24246.003 [0,3] 24129.581 A. V. Length : 79.372 [0,1] 79.391 [0,3] 79.382 V. Op. Ratio (%) : 93.105 [0,2] 93.249 [0,1] 93.177 L1 Cache Miss (sec) : 3.901 [0,0] 4.044 [0,2] 3.983 CPU Port Conf. (sec) : 3.486 [0,1] 3.486 [0,2] 3.486 V. Arith. Exec. (sec) : 15.628 [0,3] 15.646 [0,1] 15.637 V. Load Exec. (sec) : 23.156 [0,2] 23.294 [0,1] 23.225 VLD LLC Hit Element Ratio (%) : 90.954 [0,2] 90.965 [0,1] 90.959 FMA Element Count : 100000 [0,0] 100000 [0,0] 100000 Power Throttling (sec) : 0.000 [0,0] 0.000 [0,0] 0.000 Thermal Throttling (sec) : 0.000 [0,0] 0.000 [0,0] 0.000 Max Active Threads : 8 [0,0] 8 [0,0] 8 Available CPU Cores : 8 [0,0] 8 [0,0] 8 Average CPU Cores Used : 7.904 [0,2] 7.930 [0,3] 7.916 Memory Size Used (MB) : 1616.000 [0,0] 1616.000 [0,0] 1616.000 Non Swappable Memory Size Used (MB) : 115.000 [0,1] 179.000 [0,0] 131.000 Global Data of 8 Scalar processes : Min [U,R] Max [U,R] Average ================================= Real Time (sec) : 25.001 [0,7] 25.010 [0,8] 25.005 User Time (sec) : 199.916 [0,7] 199.920 [0,8] 199.918 Memory Size Used (MB) : 392.000 [0,7] 392.000 [0,8] 392.000 Overall Data of 4 Vector processes ================================== Real Time (sec) : 25.490 User Time (sec) : 801.893 Vector Time (sec) : 168.418 GOPS : 5.009 GOPS (Real) : 157.578 GFLOPS : 3.048 GFLOPS (Real) : 95.890 Memory Size Used (GB) : 6.313 Non Swappable Memory Size Used (GB) : 0.512 Overall Data of 8 Scalar processes ================================== Real Time (sec) : 25.010 User Time (sec) : 1599.344 Memory Size Used (GB) : 3.063 VE Card Data of 2 VEs ===================== Memory Size Used (MB) Min : 3232.000 [node=0,ve=0] Memory Size Used (MB) Max : 3232.000 [node=0,ve=0] Memory Size Used (MB) Avg : 3232.000 Non Swappable Memory Size Used (MB) Min : 230.000 [node=0,ve=1] Non Swappable Memory Size Used (MB) Max : 294.000 [node=0,ve=0] Non Swappable Memory Size Used (MB) Avg : 262.000 Data of Vector Process [0,0] [node=0,ve=0]: ------------------------------------------- Real Time (sec) : 25.216335 User Time (sec) : 199.533916 Vector Time (sec) : 42.127823 Inst. Count : 94780214417 V. Inst. Count : 11593360015 V. Element Count : 920199971948 V. Load Element Count : 306461345333 FLOP Count : 611078144683 MOPS : 6167.214211 MOPS (Real) : 48800.446081 MFLOPS : 3062.527699 MFLOPS (Real) : 24233.424158 A. V. Length : 79.373018 V. Op. Ratio (%) : 93.239965 L1 Cache Miss (sec) : 3.901453 CPU Port Conf. (sec) : 3.485787 V. Arith. Exec. (sec) : 15.642353 V. Load Exec. (sec) : 23.274564 VLD LLC Hit Element Ratio (%) : 90.957228 FMA Element Count : 100000 Power Throttling (sec) : 0.000000 Thermal Throttling (sec) : 0.000000 Max Active Threads : 8 Available CPU Cores : 8 Average CPU Cores Used : 7.912883 Memory Size Used (MB) : 1616.000000 Non Swappable Memory Size Used (MB) : 179.000000 ... |
When the environment variable NMPI_PROGINF_VIEW to VE_SPLIT, the reduced sections are changed as follows:
The following table
shows
the meanings of the items in the Global Data section and the Process section.
In the case of vector process,
in addition to MPI universe number and MPI rank number of MPI_COMM_WORLD,
hostname or logical node number and logical VE number are shown
as the location information of VE where the MPI process is executed
in the header of the Process section.
(*1) scalar processes outputs them only.
(*2) items are output only in the detailed reduced format or detailed extended format.
(*3) items are output only in the detailed reduced format or detailed extended format in multi-threaded execution.
(*4) item is output only for the process executed on VE30. When all processes in the aggregate range are executed on the corresponding VEs, it is output on the Global Data section.
(*5) the smaller of Max Active Threads and Available CPU Cores will be the upper limit.
Item | Unit | Description |
---|---|---|
Real Time (sec) | second | Elapsed time(*1) |
User Time (sec) | second | User CPU time(*1) |
Vector Time (sec) | second | Vector instruction execution time |
Inst. Count | The number of executed instructions | |
V.Inst. Count | The number of executed vector instructions | |
V.Element Count | The number of elements processed with vector instructions | |
V.Load Element Count | The number of vector-loaded elements | |
FLOP Count | The number of elements processed with floating-point operations | |
MOPS | The number of million operations divided by the user CPU time | |
MOPS (Real) | The number of million operations divided by the real time | |
FLOPS | The number of million floating-point operations divided by the user CPU time | |
FLOPS (Real) | The number of million floating-point operations divided by the real time | |
A.V.Length | Average Vector Length | |
V.OP.RATIO | percent | Vector operation ratio |
L1 Cache Miss (sec) | second | L1 cache miss time |
CPU Port Conf. | second | CPU port conflict time (*2) |
V. Arith Exec. | second | Vector operation execution time (*2) |
V. Load Exec. | second | Vector load instruction execution time (*2) |
|
Ratio of the number of
elements loaded from L3 cache to
the number of elements loaded
with load instructions (*4) |
|
|
Ratio of the number of
elements loaded from LLC to
the number of elements loaded
with vector load instructions |
|
|
Number of FMA execution elements (*2) |
|
Power Throttling | second | Duration of time the hardware was
throttled due to the power
consumption (*2) |
Thermal Throttling | second | Duration of time the hardware was
throttled due to the temperature (*2)
|
Max Active Threads | The maximum number of threads
that were active at
the same time (*3)
|
|
Available CPU Cores | The number of CPU cores
a process was allowed to use (*3)
|
|
|
The average number of CPU cores
used (*3) (*5)
|
|
Memory Size Used (MB) | megabyte (using base 1024) | Peak usage of memory(*1) |
Non Swappable Memory Size Used (MB) | megabyte (using base 1024) | Peak usage of memory that cannot be swapped out by Partial Process Swapping function |
The following table shows the meanings of the items in the Overall Data section in the Figure above. For scalar processes, only items(*1) are output.
Item | Unit | Description |
---|---|---|
Real Time (sec) | second | The maximum elapsed time of all MPI processes(*1) |
User Time (sec) | second | The sum of the user CPU time of all MPI processes(*1) |
Vector Time (sec) | second | The sum of the vector time of all MPI processes |
GOPS | The total number of giga operations executed on all MPI processes divided by the total user CPU time of all MPI processes | |
GOPS (Real) | The total number of giga operations executed on all MPI processes divided by the maximum real time of all MPI processes | |
GFLOPS | The total number of giga floating-point operations executed on all MPI processes divided by the total user CPU time of all MPI processes | |
GFLOPS (Real) | The total number of giga floating-point operations executed on all MPI processes divided by the maximum real time of all MPI processes | |
Memory Size Used (GB) | gigabyte (using base 1024) | The sum of peak usage of memory of all MPI processes(*1) |
Non Swappable Memory Size Used (GB) | gigabyte (using base 1024) | The sum of peak usage of memory that cannot be swapped out by Partial Process Swapping function of all MPI processes |
Item | Unit | Description |
---|---|---|
Memory Size Used (MB) Min | megabyte (using base 1024) | The minimum of peak usage of memory aggregated for each VE card |
Memory Size Used (MB) Max | megabyte (using base 1024) | The maximum of peak usage of memory aggregated for each VE card |
Memory Size Used (MB) Avg | megabyte (using base 1024) | The average of peak usage of memory aggregated for each VE card |
Non Swappable Memory Size Used (MB) Min | megabyte (using base 1024) | The minimum of peak usage of memory that cannot be swapped out by Partial Process Swapping function aggregated for each VE card |
Non Swappable Memory Size Used (MB) Max | megabyte (using base 1024) | The maximum of peak usage of memory that cannot be swapped out by Partial Process Swapping function aggregated for each VE card |
Non Swappable Memory Size Used (MB) Avg | megabyte (using base 1024) | The average of peak usage of memory that cannot be swapped out by Partial Process Swapping function aggregated for each VE card |
Global Data of 16 Vector processes : Min [U,R] Max [U,R] Average ================================== Real Time (sec) : 123.871 [0,12] 123.875 [0,10] 123.873 User Time (sec) : 123.695 [0,0] 123.770 [0,4] 123.753 Vector Time (sec) : 33.675 [0,8] 40.252 [0,14] 36.871 Inst. Count : 94783046343 [0,8] 120981685418 [0,5] 109351879970 V. Inst. Count : 2341570533 [0,8] 3423410840 [0,0] 2479317774 V. Element Count : 487920413405 [0,15] 762755268183 [0,0] 507278230562 V. Load Element Count : 47201569500 [0,8] 69707680610 [0,0] 49406464759 FLOP Count : 277294180692 [0,15] 434459800790 [0,0] 287678800758 MOPS : 5558.515 [0,8] 8301.366 [0,0] 5863.352 MOPS (Real) : 5546.927 [0,8] 8276.103 [0,0] 5850.278 MFLOPS : 2243.220 [0,15] 3518.072 [0,0] 2327.606 MFLOPS (Real) : 2238.588 [0,13] 3507.366 [0,0] 2322.405 A. V. Length : 197.901 [0,5] 222.806 [0,0] 204.169 V. Op. Ratio (%) : 83.423 [0,5] 90.639 [0,0] 85.109 L1 I-Cache Miss (sec) : 4.009 [0,5] 8.310 [0,0] 5.322 L1 O-Cache Miss (sec) : 11.951 [0,5] 17.844 [0,9] 14.826 L2 Cache Miss (sec) : 7.396 [0,5] 15.794 [0,0] 9.872 FMA Element Count : 106583464050 [0,8] 166445323660 [0,0] 110529497704 Required B/F : 2.258 [0,0] 3.150 [0,5] 2.948 Required Store B/F : 0.914 [0,0] 1.292 [0,5] 1.202 Required Load B/F : 1.344 [0,0] 1.866 [0,6] 1.746 Actual V. Load B/F : 0.223 [0,0] 0.349 [0,14] 0.322 Power Throttling (sec) : 0.000 [0,0] 0.000 [0,0] 0.000 Thermal Throttling (sec) : 0.000 [0,0] 0.000 [0,0] 0.000 Memory Size Used (MB) : 598.000 [0,0] 598.000 [0,0] 598.000 Non Swappable Memory Size Used (MB) : 115.000 [0,1] 179.000 [0,0] 131.000 |
When VE_PERF_MODE is set to VECTOR-MEM, MPI performance information outputs the following items instead of L1 Cache Miss, CPU Port Conf., V. Arith Exec., V. Load Exec. and VLD LLC Hit Element Ratio that are output when VE_PERF_MODE is set to VECTOR-OP or VE_PERF_MODE is unset.
(*1) items are output only in the detailed
reduced format or detailed extended format.
(*2) items truncate the value over 100.
(*3) item is output only for the process on VE30. When all processes in the aggregate range are executed on the corresponding VEs, it is output on the Global Data section.
(*4) item is output only for the process on VE10/VE10E/VE20. When all processes in the aggregate range are executed on the corresponding VEs, it is output on the Global Data section.
Item | Unit | Description |
---|---|---|
L1 I-Cache Miss (sec) | second | L1 instruction cache miss time |
L1 O-Cache Miss (sec) | second | L1 operand cache miss time |
L2 Cache Miss (sec) | second | L2 cache miss time |
|
Ratio of the number of
elements loaded from L3 cache to
the number of elements loaded
with load instructions (*3) |
|
|
Ratio of the number of
elements loaded from LLC to
the number of elements loaded
with vector load instructions (*3) |
|
Required B/F | B/F calculated from bytes specified by load and store instructions (*1) (*2) | |
Required Store B/F | B/F calculated from bytes specified by store instructions (*1) (*2) | |
Required Load B/F | B/F calculated from bytes specified by load instructions (*1) (*2) | |
Actual Load B/F | B/F calculated from bytes of actual memory access by load instructions (*1) (*2) (*3) | |
Actual V. Load B/F | B/F calculated from bytes of actual memory access by vector load instructions (*1) (*2) (*4) |
3.5   MPI Communication Information
NEC MPI provides the facility of displaying MPI communication information.
To use this facility, you need to generate MPI program
with the option -mpiprof, -mpitrace, -mpiverify or -ftrace.
There are two formats of MPI communication information available as follows:
The maximum, minimum, and average values of MPI communication information of all MPI processes are displayed.
MPI communication information of each MPI process is displayed in the ascending order of their ranks in the communicator MPI_COMM_WORLD after the information in the reduced format.
NMPI_COMMINF | Displayed Information |
---|---|
NO | (Default) No Output |
YES | Reduced Format |
ALL | Extended Format |
Also, you can change a view of reduced format by specifying the environment variable NMPI_COMMINF_VIEW.
NMPI_COMMINF_VIEW | Displayed Information |
---|---|
VERTICAL | (Default) Summarize for each vector process and scalar process and arrange vertically. Items that correspond only to vector processes are not output to the scalar process part. |
HORIZONTAL | Summarize for each vector process and scalar process and arrange horizontally. N/A is output to the scalar process part for items that correspond only to vector processes. |
MERGED | Summarize for vector processes and scalar processes. (V) is output at the end of line to the scalar process part for items that correspond only to vector processes. In the item, vector processes only are aggregated. |
The following figure is an example of the extended format.
MPI Communication Information of 4 Vector processes --------------------------------------------------- Min [U,R] Max [U,R] Average Real MPI Idle Time (sec) : 9.732 [0,1] 10.178 [0,3] 9.936 User MPI Idle Time (sec) : 9.699 [0,1] 10.153 [0,3] 9.904 Total real MPI Time (sec) : 13.301 [0,0] 13.405 [0,3] 13.374 Send count : 1535 [0,2] 2547 [0,1] 2037 Memory Transfer : 506 [0,3] 2024 [0,0] 1269 DMA Transfer : 0 [0,0] 1012 [0,1] 388 Recv count : 1518 [0,2] 2717 [0,0] 2071 Memory Transfer : 506 [0,2] 2024 [0,1] 1269 DMA Transfer : 0 [0,3] 1012 [0,2] 388 Barrier count : 8361 [0,2] 8653 [0,0] 8507 Bcast count : 818 [0,2] 866 [0,0] 842 Reduce count : 443 [0,0] 443 [0,0] 443 Allreduce count : 1815 [0,2] 1959 [0,0] 1887 Scan count : 0 [0,0] 0 [0,0] 0 Exscan count : 0 [0,0] 0 [0,0] 0 Redscat count : 464 [0,0] 464 [0,0] 464 Redscat_block count : 0 [0,0] 0 [0,0] 0 Gather count : 864 [0,0] 864 [0,0] 864 Gatherv count : 506 [0,0] 506 [0,0] 506 Allgather count : 485 [0,0] 485 [0,0] 485 Allgatherv count : 506 [0,0] 506 [0,0] 506 Scatter count : 485 [0,0] 485 [0,0] 485 Scatterv count : 506 [0,0] 506 [0,0] 506 Alltoall count : 506 [0,0] 506 [0,0] 506 Alltoallv count : 506 [0,0] 506 [0,0] 506 Alltoallw count : 0 [0,0] 0 [0,0] 0 Neighbor Allgather count : 0 [0,0] 0 [0,0] 0 Neighbor Allgatherv count : 0 [0,0] 0 [0,0] 0 Neighbor Alltoall count : 0 [0,0] 0 [0,0] 0 Neighbor Alltoallv count : 0 [0,0] 0 [0,0] 0 Neighbor Alltoallw count : 0 [0,0] 0 [0,0] 0 Number of bytes sent : 528482333 [0,2] 880803843 [0,1] 704643071 Memory Transfer : 176160755 [0,3] 704643020 [0,0] 440401904 DMA Transfer : 0 [0,0] 352321510 [0,1] 132120600 Number of bytes recvd : 528482265 [0,2] 880804523 [0,0] 704643207 Memory Transfer : 176160755 [0,2] 704643020 [0,1] 440401904 DMA Transfer : 0 [0,3] 352321510 [0,2] 132120600 Put count : 0 [0,0] 0 [0,0] 0 Get count : 0 [0,0] 0 [0,0] 0 Accumulate count : 0 [0,0] 0 [0,0] 0 Number of bytes put : 0 [0,0] 0 [0,0] 0 Number of bytes got : 0 [0,0] 0 [0,0] 0 Number of bytes accum : 0 [0,0] 0 [0,0] 0 MPI Communication Information of 8 Scalar processes --------------------------------------------------- Min [U,R] Max [U,R] Average Real MPI Idle Time (sec) : 4.837 [0,6] 5.367 [0,11] 5.002 User MPI Idle Time (sec) : 4.825 [0,6] 5.363 [0,11] 4.992 Total real MPI Time (sec) : 12.336 [0,11] 12.344 [0,5] 12.340 Send count : 1535 [0,4] 1535 [0,4] 1535 Memory Transfer : 506 [0,11] 1518 [0,5] 1328 Recv count : 1518 [0,4] 1518 [0,4] 1518 Memory Transfer : 506 [0,4] 1518 [0,5] 1328 ... Number of bytes accum : 0 [0,0] 0 [0,0] 0 Data of Vector Process [0,0] [node=0,ve=0]: ------------------------------------------- Real MPI Idle Time (sec) : 10.071094 User MPI Idle Time (sec) : 10.032894 Total real MPI Time (sec) : 13.301340 ... |
The following figure is an reduced format example of the NMPI_COMMINF_VIEW=MERGED.
MPI Communication Information of 4 Vector and 8 Scalar processes ---------------------------------------------------------------- Min [U,R] Max [U,R] Average Real MPI Idle Time (sec) : 4.860 [0,10] 10.193 [0,3] 6.651 User MPI Idle Time (sec) : 4.853 [0,10] 10.167 [0,3] 6.635 Total real MPI Time (sec) : 12.327 [0,4] 13.396 [0,3] 12.679 Send count : 1535 [0,2] 2547 [0,1] 1702 Memory Transfer : 506 [0,3] 2024 [0,0] 1309 DMA Transfer : 0 [0,0] 1012 [0,1] 388 (V) Recv count : 1518 [0,2] 2717 [0,0] 1702 Memory Transfer : 506 [0,2] 2024 [0,1] 1309 DMA Transfer : 0 [0,3] 1012 [0,2] 388 (V) ... Number of bytes accum : 0 [0,0] 0 [0,0] 0 |
The following table shows the meanings of the items in the MPI communication information. The item "DMA Transfer" is only supported for a vector process.
Item | Unit | Description |
---|---|---|
Real MPI Idle Time | second | Elapsed time for waiting for messages |
User MPI Idle Time | second | User CPU time for waiting for messages |
Total real MPI Time | second | Elapsed time for executing MPI procedures |
Send count | The number of invocations of point-to-point send procedures | |
Memory Transfer | The number of invocations of point-to-point send procedures that use memory copy | |
DMA Transfer | The number of invocations of point-to-point send procedures that use DMA transfer | |
Recv count | The number of invocations of point-to-point receive procedures | |
Memory Transfer | The number of invocations of point-to-point receive procedures that use memory copy | |
DMA Transfer | The number of invocations of point-to-point receive procedures that use DMA transfer | |
Barrier count | The number of invocations of the procedures MPI_BARRIER and MPI_IBARRIER | |
Bcast count | The number of invocations of the procedures MPI_BCAST and MPI_IBCAST | |
Reduce count | The number of invocations of the procedures MPI_REDUCE and MPI_IREDUCE | |
Allreduce count | The number of invocations of the procedures MPI_ALLREDUCE and MPI_IALLREDUCE | |
Scan count | The number of invocations of the procedures MPI_SCAN and MPI_ISCAN | |
Exscan count | The number of invocations of the procedures MPI_EXSCAN and MPI_IEXSCAN | |
Redscat count | The number of invocations of the procedures MPI_REDUCE_SCATTER and MPI_IREDUCE_SCATTER | |
Redscat_block count | The number of invocations of the procedures MPI_REDUCE_SCATTER_BLOCK and MPI_IREDUCE_SCATTER_BLOCK | |
Gather count | The number of invocations of the procedures MPI_GATHER and MPI_IGATHER | |
Gatherv count | The number of invocations of the procedures MPI_GATHERV and MPI_IGATHERV | |
Allgather count | The number of invocations of the procedures MPI_ALLGATHER and MPI_IALLGATHER | |
Allgatherv count | The number of invocations of the procedures MPI_ALLGATHERV and MPI_IALLGATHERV | |
Scatter count | The number of invocations of the procedures MPI_SCATTER and MPI_ISCATTER | |
Scatterv count | The number of invocations of the procedures MPI_SCATTERV and MPI_ISCATTERV | |
Alltoall count | The number of invocations of the procedures MPI_ALLTOALL and MPI_IALLTOALL | |
Alltoallv count | The number of invocations of the procedures MPI_ALLTOALLV and MPI_IALLTOALLV | |
Alltoallw count | The number of invocations of the procedures MPI_ALLTOALLW and MPI_IALLTOALLW | |
Neighbor Allgather count | The number of invocations of the procedures MPI_NEIGHBOR_ALLGATHER and MPI_INEIGHBOR_ALLGATHER | |
Neighbor Allgatherv count | The number of invocations of the procedures MPI_NEIGHBOR_ALLGATHERV and MPI_INEIGHBOR_ALLGATHERV | |
Neighbor Alltoall count | The number of invocations of the procedures MPI_NEIGHBOR_ALLTOALL and MPI_INEIGHBOR_ALLTOALL | |
Neighbor Alltoallv count | The number of invocations of the procedures MPI_NEIGHBOR_ALLTOALLV and MPI_INEIGHBOR_ALLTOALLV | |
Neighbor Alltoallw count | The number of invocations of the procedures MPI_NEIGHBOR_ALLTOALLW and MPI_INEIGHBOR_ALLTOALLW | |
Number of bytes sent | byte | The number of bytes sent by point-to-point send procedures |
Memory Transfer | byte | The number of bytes sent using memory copy by point-to-point send procedures |
DMA Transfer | byte | The number of bytes sent using DMA transfer by point-to-point send procedures |
Number of bytes recvd | byte | The number of bytes received by point-to-point receive procedures |
Memory Transfer | byte | The number of bytes received using memory copy by point-to-point receive procedures |
DMA Transfer | byte | The number of bytes received using DMA transfer by point-to-point receive procedures |
Put count | The number of invocations of the procedures MPI_PUT and MPI_RPUT | |
Memory Transfer | The number of invocations of the procedures MPI_PUT and MPI_RPUT that use memory copy | |
DMA Transfer | The number of invocations of the procedures MPI_PUT and MPI_RPUT that use DMA transfer | |
Get count | The number of invocations of the procedures MPI_GET and MPI_RGET | |
Memory Transfer | The number of invocations of the procedures MPI_GET and MPI_RGET that use memory copy | |
DMA Transfer | The number of invocations of the procedures MPI_GET and MPI_RGET that use DMA transfer | |
Accumulate count | The number of invocations of the procedures MPI_ACCUMULATE, MPI_RACCUMULATE, MPI_GET_ACCUMULATE, MPI_RGET_ACCUMULATE, MPI_FETCH_AND_OP and MPI_COMPARE_AND_SWAP | |
Memory Transfer | The number of invocations of the procedures MPI_ACCUMULATE, MPI_RACCUMULATE, MPI_GET_ACCUMULATE, MPI_RGET_ACCUMULATE, MPI_FETCH_AND_OP and MPI_COMPARE_AND_SWAP that use memory copy | |
DMA Transfer | The number of invocations of the procedures MPI_ACCUMULATE, MPI_RACCUMULATE, MPI_GET_ACCUMULATE, MPI_RGET_ACCUMULATE, MPI_FETCH_AND_OP and MPI_COMPARE_AND_SWAP that use DMA transfer | |
Number of bytes put | byte | The number of bytes put by the procedures MPI_PUT and MPI_RPUT |
Memory Transfer | byte | The number of bytes put using memory copy by the procedures MPI_PUT and MPI_RPUT |
DMA Transfer | byte | The number of bytes put using DMA transfer by the procedures MPI_PUT and MPI_RPUT |
Number of bytes got | byte | The number of bytes got by the procedures MPI_GET and MPI_RGET |
Memory Transfer | byte | The number of bytes got using memory copy by the procedures MPI_GET and MPI_RGET |
DMA Transfer | byte | The number of bytes got using DMA transfer by the procedures MPI_GET and MPI_RGET |
Number of bytes accum | byte | The number of bytes accumulated by the procedures MPI_ACCUMULATE, MPI_RACCUMULATE, MPI_GET_ACCUMULATE, MPI_RGET_ACCUMULATE, MPI_FETCH_AND_OP and MPI_COMPARE_AND_SWAP |
Memory Transfer | byte | The number of bytes accumulated using memory copy by the procedures MPI_ACCUMULATE, MPI_RACCUMULATE, MPI_GET_ACCUMULATE, MPI_RGET_ACCUMULATE, MPI_FETCH_AND_OP and MPI_COMPARE_AND_SWAP |
DMA Transfer | byte | The number of bytes accumulated using DMA transfer by the procedures MPI_ACCUMULATE, MPI_RACCUMULATE, MPI_GET_ACCUMULATE, MPI_RGET_ACCUMULATE, MPI_FETCH_AND_OP and MPI_COMPARE_AND_SWAP |
3.6   FTRACE Facility
The FTRACE facility enables users to
obtain detailed performance information
of each procedure and specified execution region of a program on each MPI
process, including MPI
communication information.
Please refer to
"PROGINF / FTRACE User's Guide" for details.
Note: FTRACE is only available in the program executed on VE.
The following table shows the MPI communication information displayed with the FTRACE facility.
Table 3-15 MPI Communication information Displayed with the FTRACE Facility Item Unit Meaning ELAPSE second Elapsed time COMM.TIME second Elapsed time for executing MPI procedures COMM.TIME / ELAPSE The ratio of the elapsed time for executing MPI procedures to the elapsed time of each process IDLE TIME second Elapsed time for waiting for messages IDLE TIME / ELAPSE The ratio of the elapsed time for waiting for messages to the elapsed time of each process AVER.LEN Byte Average amount of communication per MPI procedure (The unit is using base 1024) COUNT Total number of transfers by MPI procedures TOTAL LEN Byte Total amount of communication by MPI procedures (The unit is using base 1024)
The steps for using the FTRACE facility are as follows:
$ mpincc -ftrace mpi.c $ mpinfort -ftrace mpifort.f90 |
$ ftrace -all -f ftrace.out.0.0 ftrace.out.0.1 $ ftrace -f ftrace.out.* |
The following figure shows an example
displayed by the FTRACE facility.
*----------------------* FTRACE ANALYSIS LIST *----------------------* Execution Date : Sat Feb 17 12:44:49 2018 JST Total CPU Time : 0:03'24"569 (204.569 sec.) FREQUENCY EXCLUSIVE AVER.TIME MOPS MFLOPS V.OP AVER. VECTOR L1CACHE .... PROC.NAME TIME[sec]( % ) [msec] RATIO V.LEN TIME MISS 1012 49.093( 24.0) 48.511 23317.2 14001.4 96.97 83.2 42.132 5.511 funcA 160640 37.475( 18.3) 0.233 17874.6 9985.9 95.22 52.2 34.223 1.973 funcB 160640 30.515( 14.9) 0.190 22141.8 12263.7 95.50 52.8 29.272 0.191 funcC 160640 23.434( 11.5) 0.146 44919.9 22923.2 97.75 98.5 21.869 0.741 funcD 160640 22.462( 11.0) 0.140 42924.5 21989.6 97.73 99.4 20.951 1.212 funcE 53562928 15.371( 7.5) 0.000 1819.0 742.2 0.00 0.0 0.000 1.253 funcG 8 14.266( 7.0) 1783.201 1077.3 55.7 0.00 0.0 0.000 4.480 funcH 642560 5.641( 2.8) 0.009 487.7 0.2 46.45 35.1 1.833 1.609 funcF 2032 2.477( 1.2) 1.219 667.1 0.0 89.97 28.5 2.218 0.041 funcI 8 1.971( 1.0) 246.398 21586.7 7823.4 96.21 79.6 1.650 0.271 funcJ ------------------------------------------------------------------------------------- .... ----------- 54851346 204.569(100.0) 0.004 22508.5 12210.7 95.64 76.5 154.524 17.740 total ELAPSED COMM.TIME COMM.TIME IDLE TIME IDLE TIME AVER.LEN COUNT TOTAL LEN PROC.NAME TIME[sec] [sec] / ELAPSED [sec] / ELAPSED [byte] [byte] 12.444 0.000 0.000 0.0 0 0.0 funcA 9.420 0.000 0.000 0.0 0 0.0 funcB 7.946 0.000 0.000 0.0 0 0.0 funcG 7.688 0.000 0.000 0.0 0 0.0 funcC 7.372 0.000 0.000 0.0 0 0.0 funcH 5.897 0.000 0.000 0.0 0 0.0 funcD 5.653 0.000 0.000 0.0 0 0.0 funcE 1.699 1.475 0.756 3.1K 642560 1.9G funcF 1.073 1.054 0.987 1.0M 4064 4.0G funcI 0.704 0.045 0.045 80.0 4 320.0 funcK ------------------------------------------------------------------------------------------------------ FREQUENCY EXCLUSIVE AVER.TIME MOPS MFLOPS V.OP AVER. VECTOR L1CACHE .... PROC.NAME TIME[sec]( % ) [msec] RATIO V.LEN TIME MISS 1012 49.093( 24.0) 48.511 23317.2 14001.4 96.97 83.2 42.132 5.511 funcA 253 12.089 47.784 23666.9 14215.9 97.00 83.2 10.431 1.352 0.0 253 12.442 49.177 23009.2 13811.8 96.93 83.2 10.617 1.406 0.1 253 12.118 47.899 23607.4 14180.5 97.00 83.2 10.463 1.349 0.2 253 12.444 49.185 23002.8 13808.2 96.93 83.2 10.622 1.404 0.3 ... ------------------------------------------------------------------------------------- .... ---------- 54851346 204.569(100.0) 0.004 22508.5 12210.7 95.64 76.5 154.524 17.740 total ELAPSED COMM.TIME COMM.TIME IDLE TIME IDLE TIME AVER.LEN COUNT TOTAL LEN PROC.NAME TIME[sec] [sec] / ELAPSED [sec] / ELAPSED [byte] [byte] 12.444 0.000 0.000 0.0 0 0.0 funcA 12.090 0.000 0.000 0.000 0.000 0.0 0 0.0 0.0 12.442 0.000 0.000 0.000 0.000 0.0 0 0.0 0.1 12.119 0.000 0.000 0.000 0.000 0.0 0 0.0 0.2 12.444 0.000 0.000 0.000 0.000 0.0 0 0.0 0.3 |
3.7   MPI Procedures Tracing Facility
NEC MPI provides the facility to trace invocations of and returns from MPI procedures, and the progress of each MPI process is output to the standard output.
The following information is displayed.
The tracing facility makes it easy to see where a program runs and to debug it.
In order to use this facility, please generate MPI program with the -mpitrace option.
Note that amount of the trace output can be huge if a program calls MPI procedures many times.
By calling MPI_Abort the traceback information can be obtained.
The following is an example
[0,0] MPI Abort by user Aborting program ! [0,0] Obtained 5 stack frames. [0,0] aborttest() [0x60000003eb18] [0,0] aborttest() [0x600000006ad0] [0,0] aborttest() [0x600000005b48] [0,0] aborttest() [0x600000005cf8] [0,0] /opt/nec/ve/lib/libc.so.6(__libc_start_main+0x340) [0x600c01c407b0] [0,0] aborttest() [0x600000005a08] [0,0] Aborting program! |
[0,0] MPI Abort by user Aborting program ! [0,0] [ 0] 0x600000001718 abort_test abort.c:33 [0,0] [ 1] 0x600000001600 out out.c:9 [0,0] [ 2] 0x600000001460 hey hey.c:9 [0,0] [ 3] 0x600000001530 main main.c:13 [0,0] [ 4] 0x600c01c407a8 ? ?:? [0,0] [ 5] 0x600000000b00 ? ?:? [0,0] Aborting program! |
3.9   Debug Assist Feature for MPI
Collective Procedures
The debug assist feature for MPI
collective procedures assists users in
debugging invocations of MPI collective procedures by detecting incorrect uses
across processes and outputting detected errors in detail to the
standard error output.
The incorrect uses include the following cases
Please generate MPI program with the -mpiverify option to use this feature as follows:
$ mpinfort -mpiverify f.f90
When an error is detected, a message including the following information is output to the standard error output.
VERIFY MPI_Bcast(3): root 2 inconsistent with root 1 of 0
The errors to be detected can be specified by setting the environment variable NMPI_VERIFY at runtime as shown in the following table.
NMPI_VERIFY | Detected Errors |
---|---|
0 | No errors are detected. |
3 | (Default) Errors other than those in the argument assert of the procedure MPI_WIN_FENCE |
4 | Errors in the argument assert of the procedure MPI_WIN_FENCE, in addition to the errors detected by default |
The following table shows the errors that can be detected by the debug assist feature.
Note that this feature involves overhead for checking invocations of MPI collective procedures and can result in lower performance. Therefore, please re-generate MPI program without the -mpiverify option once the correctness of uses of collective procedures is verified.
Table 3-17 Errors Detected by the Debug Assist Feature Procedure Target of Checking Condition All collective procedures Order of invocations Processes in the same communicator, or corresponding to the same window or file handle invoked different MPI collective procedures at the same time. Procedures with the argument root Argument root The values of the argument root were not the same across processes. Collective communication procedures Message length (extent of an element * the number of elements transferred) The length of a sent message was not the same as that of the corresponding received message. Collective communication procedures that perform reduction operations Argument op The values of the argument op (reduction operator) were not the same across processes. Topology collective procedures Graph information and dimensional information Information of a graph or dimensions specified with arguments was inconsistent across processes. MPI_COMM_CREATE Argument group The groups specified with the argument group were not the same across processes. MPI_INTERCOMM_CREATE Arguments local_leader and tag The values of the argument local_leader were not the same across processes in the local communicator, or the values of the argument tag were not the same across the processes corresponding to the argument local_leader or remote_leader. MPI_INTERCOMM_MERGE Argument high The values of the argument high were not the same across processes. MPI_FILE_SET_VIEW Arguments etype and datarep The datatypes specified with the argument etype or the data representation specified with the argument datarep were not the same across processes. MPI_WIN_FENCE Argument assert The values of the argument assert were inconsistent across processes.
3.10   Exit Status of an MPI
Program
NEC MPI watches exit statuses of MPI processes to
determine whether termination of program execution is
normal termination or
error termination.
Normal termination occurs if and only if every MPI process
returns 0 as
its exit status. Otherwise error termination occurs.
Therefore, termination status of program execution should be
specified as follows
for NEC MPI to recognize the termination status correctly.
#!/bin/sh | {MPIexec} | # MPI-execution specification (Launch of MPI processes: Refer to this clause) | RC=$? | # holds the exit status | command | # non-MPI program/command | exit $RC | # specify the exit code |
3.11   Miscellaneous
This section describes additional notes in NEC MPI.
$ /opt/nec/ve/bin/nreadelf -W -d a.out | grep RUNPATH
0x000000000000001d (RUNPATH) Library runpath: [/opt/nec/ve/mpi/2.2.0/lib64/ve:...]
$ /usr/bin/strings a.out | /bin/grep "library version"
NEC MPI: library Version 2.2.0 (17. April 2019): Copyright (c) NEC Corporation 2018-2019
MPI memory management library is always dynamically linked, even if other MPI libraries are statically linked. In this case, dynamic linking of a newer version of MPI memory management library at runtime is fine, as long as the major version matches the other statically linked MPI libraries.
MPI compile commands dynamically link MPI programs even if the compiler option -static is specified. Using the compiler option -static is not recommended with MPI compile commands. MPI programs require shared system libraries and shared library for MPI memory management to execute, so MPI compile commands append -Wl,-Bdynamic to the end of the command line to force dynamic linking. The mix of -Wl,-Bdynamic appended by MPI compile commands and -static may lead unexpected behavior.
If you want to link MPI program against static libraries, you can use linker option -Bstatic and compiler options to link a program against static compiler libraries instaed of compiler option -static. When you use linker option -Bstatic, you surround libraries with -Wl,-Bstatic and -Wl,-Bdynamic. The surrounded libraries are linked statically. The following example is that libww and libxx are linked statically.
mpincc a.c -lvv -Wl,-Bstatic -lww -lxx -Wl,-Bdynamic -lyy
About the compiler options to link a program against static compiler libraries, please refer to the compiler's manual.
mkstemp: Permission denied
MPI uses HugePages to optimize MPI communications. If MPI cannot allocate HugePages on a host, the following warning message outputs and MPI program may be abnormally terminated. The configuration of the HugePages requires the system administrator privileges. If the message outputs, please refer to "SX-Aurora TSUBASA Installation Guide", or contact the system administrator for details.
mpid(0): Allocate_system_v_shared_memory: key = 0x420bf67e, len = 16777216
shmget allocation: Cannot allocate memory
The memlock resource limit needs to be set to "unlimited" for MPI to use Infininband communication and HugePages. Because this setting is applied automatically, you don't change the memlock resource limit from "unlimited" by ulimit command and so on. If the memlock resource limit is not "unlimited", there is a possibility that MPI execution aborts or MPI communication slows down with the following messages.
libibverbs: Warning: RLIMIT_MEMLOCK is 0 bytes.
This will severely limit memory registrations.
[0] MPID_OFED_Open_hca: open device failed ib_dev 0x60100002ead0 name mlx5_0
[0] Error in Infiniband/OFED initialization. Execution aborts
Even if the memlock resource limit is set to "unlimited", the following message may be output to system log. This message is not problem and MPI execution works correctly.
mpid(0): Allocate_system_v_shared_memory: key = 0xd34d79c0, len = 16777216
shmget allocation: Operation not permitted
kernel: mpid (20934): Using mlock ulimits for SHM_HUGETLB is deprecated
If the process terminates abnormally during the application execution, information related to the cause of the abnormal termination (error details, termination status, etc.) is output with the universe number and rank number. However, depending on the timing of abnormal termination, many messages such as the following may be output, making it difficult to refer to the information related to the cause of the abnormal termination.
In this case, it may be easier to refer to this information by excluding the above message. An example command is shown below.
[3] mpisx_sendx: left (abnormally) (rc=-1), sock = -1 len 0 (12)
Error in send () called by mpisx_sendx: Bad filedescriptor
$ grep -v mpisx_sendx <outputfile>
When MPI program is executed on Model A412-8, B401-8 or C401-8 using NQSV request that request multiple logical nodes, the NQSV option --use-hca needs to be set as the number of available HCAs for NEC MPI to select appropriate HCAs. Otherwise, the following error may occur at the end of MPI execution.
mpid(0): accept_process_answer: Application 0: No valid IB device found which is requested by environment variable NMPI_IP_USAGE=OFF. Specify NMPI_IP_USAGE=FALLBACK if TCP/IP should be used in this case !
When using VEO, VE memory directly passed to MPI procedure must be allocated with veo_alloc_hmem.
MPI process cannot execute the following system calls and library functions.
processes on VE: fork, popen, posix_spawn
processes on VH or scalar host: fork, system, popen, posix_spawn
Additionally, if a process on VE uses non-blocking MPI-IO and VE AIO (default value) is selected as asynchronous I/O method, the process cannot execute system() until completion of MPI-IO.
If any of these system calls or library functions is executed, the MPI program may result in problems, such as program stall or abnormal termination.
malloc_info function cannot be used in MPI programs. If malloc_info is executed in MPI programs, malloc_info may return incorrect value. MPI programs ignores M_PERFTURB, M_ARENA_MAX and M_ARENA_TEST arguments of mallopt function and MALLOC_PERFTURB_, MALLOC_ARENA_MAX and MALLOC_ARENA_TEST environment variables. (Note: In the case of VE program, VE_ is prefixed to those environment variables)
If you source the setup script "necmpivars.sh", "necmpivars.csh", "necmpivars-runtime.sh" or "necmpivars-runtime.csh" without explicit paramaters in a shell script, parameters specified to the shell script may be passed to the setup script. If invalid parameters are passed to the setup script, the following message is output and LD_LIBRARY_PATH is not updated.
necmpivars.sh: Warning: invalid argument. LD_LIBRARY_PATH is not updated.
When the AVEO UserDMA feature is enabled, available VE memory may not increase even if veo_free_hmem is called to free VE memory or veo_proc_destroy is called to terminate a VEO process.
When the AVEO UserDMA feature is enabled, users cannot call veo_proc_create or similar functions to create new VEO process after calling veo_proc_destroy. In this case, abnormal termination or illegal result may be occurred.
When an MPI program is executed through the NQSV, if the following conditions are all fulfilled, NEC MPI uses SIGSTOP, SIGCONT and SIGUSR2. Therefore, user programs cannot handle (trap, hold or ignore) those signals, and processes cannot be controlled by a debugger such as gdb. Otherwise, the MPI program may be abnormally stopped or terminated.
When using CUDA, GPU memory directly passed to MPI procedure must be allocated with cudaMalloc, cudaMallocPitch or cudaMalloc3D.
When invoking MPI processes on VE30, and also invoking MPI processes on VE10/VE10E/VE20, you cannot source the MPI setup script (necmpivars.sh and so on). In this case, the mpirun command needs to be specified as /opt/nec/ve/bin/mpirun or /opt/nec/ve3/{version}/bin/runtime/mpirun ({version} is the directory name corresponding to the version of NEC MPI you use).
When using CUDA with NVIDIA CUDA Toolkit 11.2 or before, the following message may be displayed. This means the lack of the API to enable the GPUDirect RDMA feature. Please ignore this message if you do not use the GPUDirect RDMA feature. You can suppress this message by specifying environment variable NMPI_IB_GPUDIRECT_ENABLE=OFF for disabling the GPUDirect RDMA feature.
MPID_CUDA_Init_GPUDirect: Cannot dynamically load CUDA symbol cuFlushGPUDirectRDMAWrites MPID_CUDA_Init_GPUDirect: Error message /lib64/libcuda.so: undefined symbol: cuFlushGPUDirectRDMAWrites
The MPI compile command dynamically links MPI memory management library so that calls to functions such as malloc and free in a program execute functions provided by MPI memory management library. For this reason, programs linked with MPI compile commands should not call functions such as malloc and free provided by other libraries. Doing so can cause memory corruption. If you want to implement wrappers for functions such as malloc and free, you can get the functions provided by MPI memory management library in dlsym(RTLD_NEXT), so use them.
If you reduce Non Swappable Memory during Switch Over by specifying the environment variable NMPI_SWAP_ON_HOLD=ON, Non Swappable Memory may not be reduced as expected. The target memory when using direct transfer of InfiniBand communication or the global memory returned when using the MPI procedures such as MPI_Alloc_mem become Non Swappable Memory, but since the performance may change, NMPI_SWAP_ON_HOLD=ON does not perform reducing these Non Swappable Memory. If you want to prioritize reducing non swappable memory, please specify additional environment variables depending on the result of mpirun -v.
Contents | Previous Chapter | Next Chapter | Glossary | Index |