Chapter 2


Overview of MPI features

2.1   Introduction

This chapter explains major features of MPI, which consist of the following contents:

Please refer to the following document published by the MPI Forum for further details:


2.2   Execution Management and Environment Querying

2.2.1   Initialization and Finalization of MPI Environment

Programs that use MPI must call the procedure MPI_INIT or MPI_INIT_THREAD to initialize the MPI environment, and the procedure MPI_FINALIZE to finalize.

Users can call various MPI procedures, such as point-to-point and collective communication between the initialization and finalization. The procedures MPI_INITIALIZED, MPI_FINALIZED, and MPI_GET_VERSION can be called at any time even before the invocation of the procedure MPI_INIT or MPI_INIT_THREAD and after that of the procedure MPI_FINALIZE. The procedure MPI_INITIALIZED determines whether or not the procedure MPI_INIT or MPI_INIT_THREAD has already been called, and the procedure MPI_FINALIZED determines whether or not the procedure MPI_FINALIZE has already been called.

The following is an example of a Fortran program that calls the MPI procedures for the initialization and finalization.


PROGRAM MAIN
use mpi
INTEGER IERROR
! Establish the MPI environment
CALL MPI_INIT(IERROR)

!  Terminate the MPI environment
CALL MPI_FINALIZE(IERROR)

STOP
END


int main(int argc, char **argv){
/* Establish the MPI environment */
  MPI_Init( &argc, &argv );
/* Terminate the MPI environment */
  MPI_Finalize();
}


2.2.2   Opaque Object and Handle

MPI objects managed by MPI such as communicators and MPI datatypes are called opaque objects. MPI defines data called handles corresponding to respective opaque objects. Users can indirectly operate on opaque objects by passing corresponding handles to MPI procedures, not directly access them.


2.2.3   Error Handling

MPI defines an error processing mechanism, which enables users to handle errors detected in MPI procedures, such as invalid arguments. With this mechanism, when an error is detected, the corresponding user-defined function called error handler is invoked. Please refer to the description for details.


2.3   Process Management

2.3.1   MPI Process

MPI processes are units of processing executed independently with each other with a separate CPU resource and memory space each. Execution of an MPI program starts at the beginning of the program. However, MPI procedures can be invoked only after the procedure MPI_INIT or MPI_INIT_THREAD is invoked (with the exceptions described in the description). The dynamic process creation facility enables creation of additional MPI processes during the execution of an MPI program. This facility is explained in the description.


2.3.2   Communicator

Ordered sets of MPI processes are called groups. Processes in a group have unique identification numbers of integer value starting at zero, which are called ranks. A communicator indicates a communication scope corresponding to a group of MPI processes that can communicate with each other. A communicator specifies a group, and MPI processes are identified by ranks in the group. This identification is needed for specifying processes to or from which messages are transferred.

Users can define a new communicator corresponding to a new group. MPI defines the predefined communicator MPI_COMM_WORLD as an MPI reserved name. The communicator MPI_COMM_WORLD is available after the procedure MPI_INIT or MPI_INIT_THREAD is called, and the group corresponding to the communicator contains all MPI processes that are started with the MPI execution commands.

Communicators are classified into intra-communicators and inter-communicators. The former consists of only one group of MPI processes. In communication on an intra-communicator, messages are transferred within the group. The latter consists of two disjoint groups. For a process in an intra-communicator, one group is called a local group, which the process belongs, and the other is a remote group to which the process does not belong. In communication on an inter-communicator, messages are always transferred between two groups.


2.3.3   Determining the Number of MPI Processes

The total number of MPI processes at the beginning of a program can be obtained by calling the procedure MPI_COMM_SIZE with the communicator MPI_COMM_WORLD as an argument.


2.3.4   Ranking and Identification of MPI Processes

MPI processes can be identified using a communicator and ranks, which are the identification numbers of processes in a communicator. Ranks have integer values in the range from 0 to one less than the number of MPI processes in the communicator. The rank of the executing process can be obtained by calling the procedure MPI_COMM_RANK using the communicator as an argument.

Users can construct a group of specified MPI processes and assign a communicator to the group. When a communicator is defined, ranks in the communicator are assigned to the processes. Thus, if an MPI process belongs to two communicator groups, the MPI process generally has different ranks in the respective communicators.


2.4   Dynamic Process Management

The dynamic process management facility consists of

Through the dynamic process creation facility, new MPI processes can be started dynamically during the execution of a program. In order to do so, there are two procedures MPI_COMM_SPAWN and MPI_COMM_SPAWN_MULTIPLE. The former can be used to spawn MPI processes from a single executable file with the same arguments. The latter can be used to spawn MPI processes from two or more executable files. This latter call must be used if a single executable file is used but with different arguments for each process.

As a result of dynamic process creation, a new communicator MPI_COMM_WORLD, which is completely different from the existing communicator MPI_COMM_WORLD, is generated and contains the spawned MPI processes. Furthermore, an inter-communicator that consists of the existing communicator MPI_COMM_WORLD and the new MPI_COMM_WORLD is generated.

MPI processes started by the procedure MPI_COMM_SPAWN or MPI_COMM_SPAWN_MULTIPLE run as if they were started with the MPI execution commands. Child processes started can obtain an inter-communicator that is necessary to communicate their parent processes.

The functions for dynamic process connection can be used to establish communication between two groups of MPI processes that do not share a communicator. The procedures are MPI_OPEN_PORT, MPI_COMM_ACCEPT, and MPI_CLOSE_PORT for the server, and the procedure MPI_COMM_CONNECT for the client. Both groups will be connected by an inter-communicator. The communication between two groups will be performed by TCP/IP function calls within NEC MPI. Note that only MPI applications that use the same numerical storage width can be connected by dynamic process connection.


2.5   Inter-Process Communication

Inter-Process Communication functionality can be classified into


2.5.1   Point-to-Point Communication

In point-to-point communication, communication between two MPI processes is established when one MPI process calls the send procedure and the other MPI process calls the receive procedure.

The arguments used to call either a send procedure or a receive procedure include the user data, which is the object of the communication, its type and size, the communicator and the rank that identify the party with whom a communication is established, and a tag that identifies a communication request. A tag is attached to a communication message in a send operation. In receive operation, it is used to identify a communication message. A communication message includes not only user data but also its type and size, and rank and tag information.

Specifying the MPI reserved name MPI_ANY_TAG as the tag in the receive call makes it possible to receive a message that is sent by the party with the specified rank, regardless the value of its tag. Moreover, specifying the MPI reserved name MPI_ANY_SOURCE as the rank makes it possible to receive the message having the specified tag value without specifying the party. When MPI_ANY_TAG and MPI_ANY_SOURCE are specified together, the message that is sent first is received without specifying the party or the tag. Neither MPI_ANY_TAG nor MPI_ANY_SOURCE can be specified in the send procedure.

Communication Modes

Point-to-point communication consists of four communication modes: synchronous, buffer, ready, and standard.

The name of the send procedure used determines the specific mode to be used in a particular communication session. For all modes, the name of the receive procedure is MPI_RECV.

This section discusses the four communication modes.

Nonblocking Communication

The four communication modes described in the description are also called blocking communication modes. In point-to-point communication, there are also nonblocking communication operations for each of the four modes.

Whereas blocking communication creates and sends communication messages within the send procedure and selects and receives communication messages within the receive procedure, the nonblocking communication mode requires the execution of a communication startup procedure and a communication termination confirmation procedure. The communication startup procedure returns control to the user after receiving arguments and generating communication requests. Therefore, normal communication processes, such as generating and sending communication messages and selecting and receiving communication messages, are performed in the MPI implementation concurrently with the processing of the user's program.

Communication requests can establish an association between a communication startup procedure and a communication termination confirmation procedure. MPI enforces the end of a communication by calling the communication termination confirmation procedure whose argument is the communication request.

When using a nonblocking communication mode, the user must make sure that the send process does not update the user data to be transmitted. The user must also make sure that the receiving process does not update or reference the transmission process from the time the communication startup procedure is called until the communication termination confirmation procedure is called.

The procedure name for nonblocking communication has the letter I attached to the prefix MPI_ of the blocking communication procedure name. Thus, the name of the begin-to-receive procedure is MPI_IRECV and the name of the begin-to-send procedure is MPI_ISSEND for synchronous mode, MPI_IBSEND for buffered mode, MPI_IRSEND for ready mode, and MPI_ISEND for standard mode. The name of the procedure to wait for the completion of communication is MPI_WAIT. This name is common among all send and receive procedures in any mode.

It is not necessary that sending in blocking communication be associated with reception in blocking communication or that sending in nonblocking communication be associated with reception in nonblocking communication. Data transmitted in nonblocking mode can be received in blocking mode, or vice versa.

The following is an example of a Fortran program that performs nonblocking communication in standard mode both for send and receive.

Example:

SUBROUTINE SUB(DATA1, N1, DATA2, N2)
use mpi
INTEGER MYRANK, IERROR, IRANK, ITAG
INTEGER ISTAT(MPI_STATUS_SIZE)
! Define the type of two handles, IREQ0 and IREQ1
INTEGER IREQ0, IREQ1
DOUBLE PRECISION DATA1(N1), DATA2(N2)
CALL MPI_COMM_RANK(MPI_COMM_WORLD, MYRANK,IERROR)

ITAG1=10
IF(MYRANK.EQ.0) THEN
  IRANK=1
! nonblocking standard mode send operation of DATA1 to the process with rank=1 
  CALL MPI_ISEND(DATA1,N1,MPI_INTEGER, IRANK,ITAG,MPI_COMM_WORLD,IREQ0,IERROR)
ENDIF

IF(MYRANK.EQ.1) THEN
  IRANK=0
! nonblocking standard mode receive operation of DATA1 from the process with rank=0 
  CALL MPI_IRECV(DATA1,N1,MPI_INTEGER,IRANK,ITAG,MPI_COMM_WORLD,IREQ1,IERROR)
ENDIF

DO I=1,N2
  DATA2(I)=SQRT(REAL(I))
ENDDO

IF(MYRANK.EQ.0) THEN
!  wait until the send operation completes
  CALL MPI_WAIT(IREQ0,ISTAT,IERROR)
ENDIF

IF(MYRANK.EQ.1) THEN
! wait until the receive operation completes 
  CALL MPI_WAIT(IREQ1,ISTAT,IERROR)
ENDIF
RETURN
END

int sub(double *data1, int n1, double *data2, int n2){
int myrank, irank, itag;
/* Define the type of two handles, ireq0 and ireq1 */
MPI_Request ireq0, ireq1; 
MPI_Status status;

MPI_Comm_rank(MPI_COMM_WORLD, &myrank);

itag=10;
if(myrank==0){
  irank=1;
/* nonblocking standard mode send operation of data1 to the process with rank=1 */
MPI_Isend(data1,n1,MPI_INT,irank, itag, MPI_COMM_WORLD, &ireq0);
  }
if(myrank==1){
  irank=0;
/* nonblocking standard mode receive operation of data1 from the process with rank=0 */
  MPI_Irecv(data1,n1,MPI_INT,irank, itag, MPI_COMM_WORLD, &ireq1);
}
for(int i=0;n2>i;i++)data2[i]=sqrt((double) (i+1));
/* wait until the send operation completes */
if(myrank == 0)MPI_Wait(&ireq0,&status);
/* wait until the receive operation completes */ 
if(myrank == 1)MPI_Wait(&ireq1,&status);

return MPI_SUCCESS;
}

Prior Confirmation of Communication Messages

In point-to-point communication, before calling the receive procedure the program can verify whether or not a target communication message has been sent and is ready to be received. The procedures MPI_PROBE and MPI_IPROBE are used to interrogate communication messages. The status of a target communication message can be ascertained by specifying the communicator and the rank value for the transmission MPI process and the tag value of the communication message in arguments. The procedure MPI_PROBE waits until the target communication message can be received. The procedure MPI_IPROBE, on the other hand, only returns a receive ready or not ready status without waiting for the message.


2.5.2   Collective Operations

In a collective operation, all MPI processes belonging to a group associated with a given communicator call a collective operation procedure in order to perform communication, computation and/or execution controls. Collective routines all specify a communicator as argument.

The collective operations are applicable to an intra-communicator and inter-communicator.

This section explains synchronous control, broadcasting communication, and GATHER/SCATTER communication, which are principal functions in the collective operations.

Synchronization

The name of the procedure for synchronization is MPI_BARRIER. Each MPI process in the group that calls this procedure waits until all other MPI processes in the communicator specified in the argument list have also invoked the procedure. When an inter-communicator is specified as the communicator argument, the procedure MPI_BARRIER returns only after all MPI processes, in both local and remote groups.

The following is an example of a Fortran program that uses synchronous control (barrier). In this example, an intra-communicator MPI_COMM_WORLD is used.

Example:

 SUBROUTINE SUB(DATA, N)
use mpi
INTEGER MYRANK,IERROR,IRANK,ITAG
INTEGER DATA(N)
INTEGER IREQ
INTEGER STATUS(MPI_STATUS_SIZE)

CALL MPI_COMM_RANK(MPI_COMM_WORLD,MYRANK,IERROR)

ITAG=10

IF(MYRANK.EQ.1) THEN
  IRANK=0
  CALL MPI_IRECV(DATA,N,MPI_INTEGER,IRANK,ITAG,MPI_COMM_WORLD,IREQ,IERROR)
ENDIF
! Ensure that all processes in the communicator have completed 
! the MPI_Irecv operation

CALL MPI_BARRIER(MPI_COMM_WORLD,IERROR)

IF(MYRANK.EQ.0) THEN
  IRANK=1
  CALL MPI_RSEND(DATA,N,MPI_INTEGER,IRANK,ITAG,MPI_COMM_WORLD,IERROR)
ENDIF

IF(MYRANK.EQ.1) THEN
 CALL MPI_WAIT(IREQ,STATUS,IERROR)
ENDIF

RETURN
END

 int sub(int *data, int n){
int myrank, irank, itag;
MPI_Request ireq; 
MPI_Status status;

MPI_Comm_rank(MPI_COMM_WORLD, &myrank);

itag=10;

if(myrank==1){
  irank=0;
  MPI_Irecv(data,n,MPI_INT,irank, itag, MPI_COMM_WORLD, &ireq);
}
/* Ensure that all processes in the communicator have completed the MPI_Irecv operation */
MPI_Barrier(MPI_COMM_WORLD);

if(myrank==0){
  irank=1;
  MPI_Isend(data,n,MPI_INT,irank, itag, MPI_COMM_WORLD, &ireq);
}

if(myrank == 1)MPI_Wait(&ireq,&status);

return MPI_SUCCESS;
}

Broadcast

The name of the procedure for broadcast communication is MPI_BCAST. This procedure takes data area, data type, data size, and rank as the arguments in addition to the communicator argument. All MPI processes that call the procedure must specify the same value as the argument rank. In broadcasting communication, the MPI process that has the specified rank becomes the root process that sends data. All other MPI processes receive data.

The root process means the unique sender or receiver in a collective operation. The root process is used not only in broadcasting communication, but also in GATHER communication and SCATTER communication described in the next section.

When an inter-communicator is specified as the argument communicator, the root process exists in either of two groups in the inter-communicator. Data in the root process is transferred to all MPI processes in the remote group.

The following is an example of a Fortran program that uses broadcasting communication. In this example, the rank-0 MPI process specified in the rank value argument is the root process that sends DATA, and all other MPI processes receive DATA.

Example:

 SUBROUTINE SUB(DATA, N)
use mpi
INTEGER IERROR,IROOT,MYRANK
INTEGER DATA(N)

CALL MPI_COMM_RANK(MPI_COMM_WORLD,MYRANK,IERROR)

IF(MYRANK.EQ.0) THEN
  DO I = 1,N
    DATA(I) = I
  ENDDO
ENDIF
! broadcast send operation of data from the process with rank=0 to 
! all other processes in MPI_COMM_WORLD
IROOT = 0
CALL MPI_BCAST(DATA,N,MPI_INTEGER,IROOT,MPI_COMM_WORLD,IERROR)

RETURN
END

 
int sub(int *data, int n){
 int myrank, iroot;
 MPI_Comm_rank(MPI_COMM_WORLD, &myrank);

 if(myrank == 0){
  for(int i=0;n>i;i++)data[i]=i;
 }
 iroot = 0;
 /* broadcast send operation of data from the process
  with rank=0 to all other processes in MPI_COMM_WORLD */
  MPI_Bcast(data, n, MPI_INT, iroot, MPI_COMM_WORLD);

  return MPI_SUCCESS;
}

Gather Communication and Scatter Communication

The procedure MPI_GATHER gathers data from all MPI processes into the root process. The procedure MPI_SCATTER distributes data from the root process to all MPI processes. The SCATTER communication differs from the broadcasting communication in that, whereas the broadcasting communication broadcasts the same data to all MPI processes in the communicator, the SCATTER communication transmits different parts of the send buffer to respective different MPI processes.

When an inter-communicator is specified as the argument communicator, data-gathering operation in the procedure MPI_GATHER is performed within the group where the root process does not exist in the inter-communicator. The result is transferred to the root process in the remote group. In the procedure MPI_SCATTER, data on the root process is distributed to MPI processes in the remote group.

The following is an example of a Fortran program that uses the procedure MPI_GATHER, where N denotes the number of elements in the array ALLDATA, and M denotes the number of elements in the array PARTDATA. The value of N must be greater than or equal to that of M * "the number of processes in the communicator MPI_COMM_WORLD".

Example:

SUBROUTINE SUB(ALLDATA, N, PARTDATA, M)
use mpi
INTEGER IERROR,IROOT
INTEGER ALLDATA(N),PARTDATA(M)

! gather operation of the buffer PARTDATA from all processes in MPI_COMM_WORKLD 
! to the buffer ALLDATA on the root process with rank=0
IROOT=0
CALL MPI_GATHER(PARTDATA,M,MPI_INTEGER,ALLDATA,M,MPI_INTEGER, IROOT,  & 
       MPI_COMM_WORLD,IERROR)

RETURN
END

 int sub(int *alldata, int n, int *partdata, int m){
int iroot;
iroot = 0;
/* gather operation of the buffer partdata from all processes in  MPI_COMM_WORKLD
to the buffer alldata on the root process with rank=0 */
MPI_Gather(partdata, m, MPI_INT, alldata, m, MPI_INT,iroot, MPI_COMM_WORLD);

return MPI_SUCCESS;
}

The following is an example of a Fortran program that uses the procedure MPI_SCATTER, where N denotes the number of elements in the array ALLDATA, and M denotes the number of elements in the array PARTDATA. The value of N must be greater than or equal to that of M * "the number of processes in the communicator MPI_COMM_WORLD".

Example:

 SUBROUTINE SUB(ALLDATA, N, PARTDATA, M)
use mpi
INTEGER IERROR,IROOT
INTEGER ALLDATA(N),PARTDATA(M)

! scatter operation of the buffer ALLDATA from the root process with rank=0  
! to the buffers PARTDATA on all processes in MPI_COMM_WORKLD 
IROOT=0
CALL MPI_SCATTER(ALLDATA,M,MPI_INTEGER,PARTDATA,M,MPI_INTEGER,   &
IROOT, MPI_COMM_WORLD,IERROR)

RETURN
END

 int sub(int *alldata, int n, int *partdata, int m){
int iroot;

iroot = 0;
/* scatter operation of of the buffer alldata from the root process with rank=0 to buffers partdata on all processes in MPI_COMM_WORKLD */
MPI_Scatter(partdata, m, MPI_INT, alldata, m, MPI_INT, iroot, MPI_COMM_WORLD);

return MPI_SUCCESS;
}

Nonblocking Collective Operations

The collective operations explained in the previous sections are blocking operations, which perform communication and calculations, and return control to the user program after these operations are completed. Collective operations have nonblocking variants as well as point-to-point communication. The procedure names for nonblocking communication have the letter I attached immediately after the prefix MPI_ of the blocking communication procedure name; for example, the procedure MPI_IBCAST performs nonblocking broadcasting.

A nonblocking call initiates a collective operation, and may return control to the user program before the operation completes. A communication request is made and returned by the nonblocking call, which is used to wait for completion of the operation. Nonblocking collective operations can be used in the same way as point-to-point communication, but there are some differences described below.

HPC-X (SHARP) Support

NEC MPI supports offloading MPI collectives to InfiniBand network by using the SHARP feature provided by HPC-X software package of NVIDIA Corporation. This offloading feature allows improvements on performance and scalability of MPI collective communication procedures listed below, if MPI_COMM_WORLD or a communicator that is composed of the same MPI process group as MPI_COMM_WORLD is used.

The environment variables described in Section 3.2.4 can be used to take control of SHARP usage in NEC MPI. Note that this offloading feature in NEC MPI can only be used, if InfiniBand and SHARP environment are properly configured and all MPI processes are running in VEs.

2.5.3   One-Sided Communication

The one-sided communication is a set of operations in which cooperation with another process is not explicitly required for the transfer of data, that is, an MPI process can alone transfer data from/to another process.

Window

In order to perform one-sided communication operation, an MPI process has to register its memory region to which the MPI process allows the other process to get access. The region is called a window. The procedures MPI_WIN_CREATE, MPI_WIN_ALLOCATE, MPI_WIN_ALLOCATE_SHARED, and MPI_WIN_CREATE_DYNAMIC are used to register a window region. If the procedure MPI_WIN_ALLOCATE_DYNAMIC is used, a separate call to the procedure MPI_WIN_ATTACH is required to attach memory region to the window. The procedure MPI_WIN_FREE frees the window. These procedures are collective operations.

One-Sided Communication Procedures

The following procedures perform one-sided communication:

The procedure MPI_GET reads out data from a window on a target process. The procedure MPI_PUT writes data to a window on a target process. The procedure MPI_ACCUMULATE is similar to the procedure MPI_PUT, but can perform a specified operation instead of the simple update. The procedure MPI_GET_ACCUMULATE also performs the specified operation and updates the target window, but the contents in the target window before update are returned to the origin process. The procedure MPI_FETCH_AND_OP performs the same operation as the procedure MPI_GET_ACCUMULATE, and for this procedure, only one element can be specified. The procedure MPI_COMPARE_AND_SWAP replaces the target window with the origin buffer, if the comparison buffer is the same as the target buffer. The target buffer is always returned to the origin process.

The procedures MPI_RGET, MPI_RPUT, MPI_RACCUMULATE, and MPI_RGET_ACCUMULTE perform the same operations as the procedures MPI_GET, MPI_PUT, MPI_ACCUMULTE, and MPI_GET_ACCUMULTE, respectively, and return MPI communication requests. A completion call, such as the procedures MPI_WAIT and MPI_TEST, can be used to complete the operation. A synchronization call for one-sided communication, described below, completes the communication initiated by these procedures, but a completion call is still required.

Synchronization Calls

Three kinds of synchronization mechanisms are provided to synchronize one-sided communication.

Assertions

The procedures MPI_WIN_POST, MPI_WIN_START, MPI_WIN_FENCE, MPI_WIN_LOCK and MPI_WIN_LOCK_ALL can take assertion arguments, which tell NEC MPI the context in which they are called. Valid assertions are as follows:

These assertions never the change program semantics and can be 0, which does not assert any conditions. The table shows the meaning of each assertion and which assertions the NEC MPI uses.

Table 2-1   Assertions for One-Sided Communication

MPI_WIN_POST
assertion meaning NEC MPI
MPI_MODE_NOCHECK The corresponding calls to the procedure MPI_WIN_START have not yet occurred. MPI_MODE_NOCHECK has to be specified for both procedures MPI_WIN_POST and MPI_WIN_START. used
MPI_MODE_NOSTORE After the previous synchronization, the local window is not updated by local stores. ignored
MPI_MODE_NOPUT After the post call, the local window is not updated by the procedures MPI_PUT and MPI_ACCUMULATE. used
MPI_WIN_START
assertion meaning NEC MPI
MPI_MODE_NOCHECK The corresponding calls to the procedure MPI_WIN_POST have already completed. MPI_MODE_NOCHECK has to be specified for both procedures MPI_WIN_POST and MPI_WIN_START. used
MPI_WIN_FENCE
assertion meaning NEC MPI
MPI_MODE_NOSTORE After the previous synchronization, the local window is not updated by local store. ignored
MPI_MODE_NOPUT The local window will not be updated by the procedures MPI_PUT and MPI_ACCUMULATE until the next synchronization. used
MPI_MODE_NOPRECEDE The synchronization does not complete any locally issued communication operations. MPI_MODE_NOPRECEDE has to be specified by all processes. used
MPI_MODE_NOSUCCEED After the synchronization, no communication operations will be issued. MPI_MODE_NOSUCCEED has to be specified by all processes. used
MPI_MODE_NEC_GETPUTALIGNED Asserts that all one-sided communication operations were via direct memory copy, see the description. used
MPI_WIN_LOCK, MPI_WIN_LOCK_ALL
assertion meaning NEC MPI
MPI_MODE_NOCHECK No other process holds or will attempt to get a conflicting lock, while calling process holds the lock. used

Example of One-Sided Communication

The following C program fragment shows an example using the one-sided communication. In this example, data of a root process expressed by argument root is transferred to other processes in the communicator comm. The procedure MPI_GET is used for data-transfer and the procedure MPI_FENCE is used for the synchronization.

Example:

SUBROUTINE PROPAGATE(BUFF,N)
use mpi
INTEGER ROOT, IERROR, RANK, DISP
! Define the type of a window object, WIN
INTEGER WIN
INTEGER BUFF(N)

INTEGER(KIND=MPI_ADDRESS_KIND) OFFSET
OFFSET=8
ROOT=0

CALL MPI_COMM_RANK(MPI_COMM_WORLD, RANK, IERROR)
! Create memory window for process root and all other processes in the communicator
IF(RANK==ROOT) THEN
  CALL MPI_WIN_CREATE(BUFF, OFFSET*N, DISP, MPI_INFO_NULL, MPI_COMM_WORLD,  &
  WIN, IERROR)
ELSE
  CALL MPI_WIN_CREATE(MPI_BOTTOM, OFFSET*0, DISP, MPI_INFO_NULL, MPI_COMM_WORLD, & 
WIN, IERROR)
ENDIF

! All other processes get data from process root
CALL MPI_WIN_FENCE(MPI_MODE_NOPRECEDE, WIN, IERROR)
IF(RANK/=ROOT) THEN
  CALL MPI_GET(BUFF,N, MPI_INT, ROOT, OFFSET*0, N, MPI_INT, WIN, IERROR)
ENDIF
! Complete the one-sided communication operation
CALL MPI_WIN_FENCE(MPI_MODE_NOSUCCEED, WIN, IERROR)
! Free the window
CALL MPI_WIN_FREE(WIN, IERROR)

RETURN
END SUBROUTINE PROPAGATE

int propagate(int *buf, int n, int root, MPI_Comm comm){
/* Define the type of a window object, win */  
MPI_Win win;
int unit = sizeof(int); /* displacement unit of the window */
int rank;

MPI_Comm_rank(comm, &rank);
/* Create memory window for process root and all other processes in the communicator */
if (rank == root)
MPI_Win_create(buf, unit * n, unit, MPI_INFO_NULL, comm, &win);
else
MPI_Win_create(NULL, 0, unit, MPI_INFO_NULL, comm, &win);

/* All other processes get data from process root */
MPI_Win_fence(MPI_MODE_NOPRECEDE, win);
if (rank != root)
MPI_Get(buf, n, MPI_INT, root, 0, n, MPI_INT, win);
/* Complete the one-sided communication operation */
MPI_Win_fence(MPI_MODE_NOSUCCEED, win);
/* Free the window */
MPI_Win_free(&win);
return MPI_SUCCESS;
}


2.6   Parallel I/O

2.6.1   Overview

MPI defines a parallel I/O, called MPI-IO, so that MPI processes can get access to the same file concurrently. The MPI-IO enables highly optimized I/O operations by, for example, expressing file access pattern for each MPI process using MPI datatypes, and providing collective I/O operations.

In addition, even if data to be accessed by MPI processes is actually stored non-contiguously in a file, the data can be looked as a sequential stream at I/O operations without any awareness about data location as well as a case of MPI communication.


2.6.2   View

In the case that file I/O operations will be performed by using the MPI-IO each MPI process has to declare in advance which segments of a file the process will get access to. The segment to be accessed is called a view. The view is expressed by an MPI data type, a basic or a derived data type, and it can express various access patterns easily.


2.6.3   Principal Procedures

In order to use the MPI-IO facility, the following operations must be performed.

  1. Open
  2. Set a view
  3. I/O operations
  4. Close
Setting a view is in general performed just after a file was opened. It is also possible to change the view of an opened file later during the execution. The procedure MPI_FILE_OPEN is used to open a file, the procedure MPI_FILE_CLOSE is used to close a file, and the procedure MPI_FILE_SET_VIEW sets a view. There procedures are all collective operations. If a derived MPI datatype instead of an MPI datatype is used for a view, the MPI datatype has to be constructed and committed beforehand.

The MPI-IO provides various data access procedures, and an independent procedure exists for every combination of the following aspects.

  1. Input or Output
  2. Positioning
  3. Synchronism
  4. Coordination
Table 2-2 shows the data access procedures for every the combinations.

Table 2-2   Data Access Procedures

Positioning Synchronism Coordination
Non-Collective Collective
Explicit Offset Blocking MPI_FILE_READ_AT
MPI_FILE_WRITE_AT
MPI_FILE_READ_AT_ALL
MPI_FILE_WRITE_AT_ALL
Nonblocking MPI_FILE_IREAD_AT
MPI_FILE_IWRITE_AT
MPI_FILE_IREAD_AT_ALL
MPI_FILE_IWRITE_AT_ALL
Split Collective N/A MPI_FILE_READ_AT_ALL_BEGIN
MPI_FILE_READ_AT_ALL_END
MPI_FILE_WRITE_AT_ALL_BEGIN
MPI_FILE_WRITE_AT_ALL_END
Individual File Pointers Blocking MPI_FILE_READ
MPI_FILE_WRITE
MPI_FILE_READ_ALL
MPI_FILE_WRITE_ALL
Nonblocking MPI_FILE_IREAD
MPI_FILE_IWRITE
MPI_FILE_IREAD_ALL
MPI_FILE_IWRITE_ALL
Split Collective N/A MPI_FILE_READ_ALL_BEGIN
MPI_FILE_READ_ALL_END
MPI_FILE_WRITE_ALL_BEGIN
MPI_FILE_WRITE_ALL_END
Shared File Pointer Blocking MPI_FILE_READ_SHARED
MPI_FILE_WRITE_SHARED
MPI_FILE_READ_ALL_ORDERED
MPI_FILE_WRITE_ALL_ORDERED
Nonblocking MPI_FILE_IREAD_SHARED
MPI_FILE_IWRITE_SHARED
N/A
Split Collective N/A MPI_FILE_READ_ALL_ORDERED_BEGIN
MPI_FILE_READ_ALL_ORDERED_END
MPI_FILE_WRITE_ALL_ORDERED_BEGIN
MPI_FILE_WRITE_ALL_ORDERED_END


2.6.4   File Name Prefixes in the Procedure MPI_FILE_OPEN

NEC MPI determines file system types using standard C library functions by default. Application programs can explicitly specify a file system type using a prefix in the file name given to the procedure MPI_FILE_OPEN. NEC MPI currently accepts the prefixes "nfs:" and "stfs:", which indicate NFS and ScaTeFS, respectively.


2.6.5   Example Program

The following C program fragment shows an example using the MPI-IO. This program uses procedures using the shared file pointer, MPI_FILE_WRITE_SHARED, and it appends data to a file associated with argument filename.

Example:

PROGRAM MAIN
use mpi
IMPLICIT NONE
INTEGER :: IERR
INTEGER :: SBUF (4)
INTEGER  :: RBUF(16)
INTEGER :: N
INTEGER :: MYRANK
INTEGER :: NPROCS
! Define the type of a file handle, FH
INTEGER :: FH
INTEGER(8) :: DISP = 0
INTEGER(8) :: OFFSET = 0

CALL MPI_INIT(IERR)
CALL MPI_COMM_RANK(MPI_COMM_WORLD, MYRANK, IERR)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD, NPROCS, IERR)
SBUF = MYRANK
RBUF = -1
N = 4

print *, MYRANK, ": SBUF =>", SBUF
! Open a common file with a name "output.dat"
CALL MPI_FILE_OPEN(MPI_COMM_WORLD, "output.dat", MPI_MODE_RDWR+MPI_MODE_CREATE, & 
MPI_INFO_NULL, FH, IERR)
! Set a process's file view
CALL MPI_FILE_SET_VIEW(FH, DISP, MPI_INT, MPI_INT,"native", MPI_INFO_NULL, IERR)

! Write data starting from the current location of the shared file pointer
CALL MPI_FILE_WRITE_SHARED(FH, SBUF, N, MPI_INT, MPI_STATUS_IGNORE, IERR)
! Cause all previous writes to be transferred to the storage device 
CALL MPI_FILE_SYNC(FH, IERR)
IF(MYRANK == 0) THEN
! Move an individual file pointer to the location in the file from which 
! a process needs to read data 
  CALL MPI_FILE_SEEK(FH, OFFSET, MPI_SEEK_SET, IERR)
! Read data from the current location of the individual file pointer in the file
  CALL MPI_FILE_READ(FH, RBUF, N*min(NPROCS,4), MPI_INT, MPI_STATUS_IGNORE, IERR)
  print *, MYRANK, ": RBUF =>", RBUF
ENDIF
! Close the common file
CALL MPI_FILE_CLOSE(FH, IERR)
CALL MPI_FINALIZE(IERR)

END PROGRAM  

#include "mpi.h"
#include <stdio.h>

int main(int argc, char *argv[])
{
  int sbuf[4];
  int rbuf[16];
  int n;
  int myrank;
  int nprocess;
  int disp = 0;
  int offset = 0;
  int tmp;
  /* Define the type of a file handle, fh */
  MPI_File fh;

  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
  MPI_Comm_size(MPI_COMM_WORLD, &nprocess);

  for(int i=0; i<4; i++) sbuf[i] = myrank;
  for(int i=0; i<16; i++) rbuf[i] = -1;
  n = 4;

  printf("%d:sbuf=>", myrank);
  for(int i=0; i<4; i++) printf("%d ", sbuf[i]);
  printf("\n");

  /* Open a common file with a name "output.dat" */
  MPI_File_open(MPI_COMM_WORLD, "output.dat",
      MPI_MODE_RDWR|MPI_MODE_CREATE, MPI_INFO_NULL, &fh);
  /* Set a process's file view */
  MPI_File_set_view(fh, disp, MPI_INT, MPI_INT, "native", MPI_INFO_NULL);

  /* Write data starting from the current location of the shared file pointer */ 
  MPI_File_write_shared(fh, sbuf, n, MPI_INT, MPI_STATUS_IGNORE);
  /* Cause all previous writes to be transferred to the storage device  */
  MPI_File_sync(fh);

  if (myrank==0) {
    printf("%d READ ALL FROM FILE\n", myrank);
    /* Move an individual file pointer to the location in the file from which
       a process needs to read data */
    MPI_File_seek(fh, offset, MPI_SEEK_SET);
    if (nprocess < 4) {
      tmp = nprocess;
    } else {
      tmp = 4;
    }
    /* Read data from the current location of the individual file pointer in the file */
    MPI_File_read(fh, rbuf, n*tmp, MPI_INT, MPI_STATUS_IGNORE);
    printf("%d: rbuf=>", myrank);
    for(int i=0; i<16; i++) printf("%d ", rbuf[i]);
    printf("\n");
  }
  /* Close the common file */
  MPI_File_close(&fh);
  MPI_Finalize();
  return 0;
}


2.6.6   Acceleration of Parallel I/O

The parallel I/O performance can be improved by using ScaTeFS VE direct IB Library or Accelerated I/O function when executing MPI programs on VE. The usage conditions and usage examples are as follows.


2.7   Thread Execution Environment

MPI defines levels of thread support as follows;

If processes are multi-threaded, the processes use the procedure MPI_INIT_THREAD instead of the procedure MPI_INIT for the initialization of MPI. In addition to the arguments of the procedure MPI_INIT, the procedure MPI_INIT_THREAD requires an argument to get the available thread support level.

The thread support level in NEC MPI is MPI_THREAD_SERIALIZED. Therefore, all threads in an MPI process can make calls to MPI procedures. However, users must ensure that two or more threads never call MPI procedures at the same time. And, if synchronization among threads is performed with, for example, a lock file or a shared variable instead of the standard synchronization methods provided by the shared-memory parallel processing such as OpenMP or the POSIX thread standard, cache coherency may not be guaranteed among threads.


2.8   Error Handler

MPI defines an error handling mechanism that invokes a specified procedure (error handler) when an MPI procedure detects an error such as "invalid arguments".

An error handler can be attached to each communicator, to each file handle used for the MPI-IO, and to each window used for the one-sided communication, and the error handler is called when an error occurs in MPI communication.

In order to attach and use error handlers, they have to be registered. The procedure MPI_XXX_CREATE_ERRHANDLER registers an error handler where XXX is COMM for a communicator, WIN for a window and FILE for a file handler. And then the procedure MPI_XXX_SET_ERRHANDLER attaches the registered error handler. The procedure MPI_XXX_GET_ERRHANDLER is used to get the error handler attached.


2.9   Attribute Caching

MPI provides a caching facility that allows MPI processes to attach arbitrary attributes to communicators, windows, and datatypes.

Before using attributes, it is necessary to create and register attribute keys with the procedure MPI_XXX_CREATE_KEYVAL, where XXX is COMM for a communicator, WIN for a window and TYPE for a datatype, the procedure MPI_XXX_SET_ATTR attaches information, the procedure MPI_XXX_GET_ATTR gets the cached information, and the procedure MPI_XXX_DELETE_ATTR removes the cached information away.


2.10   Fortran Support

NEC MPI provides the following Fortran support methods.

The support status of the module mpi_f08 depends on a compiler used and its version as follows.


2.11   Info Object

The Info Object, MPI_INFO, is used, for example, to give a sort of optimization information to an MPI procedure. The MPI_INFO object consists of a pair of character strings; one is a key and the other is a value corresponding to the key. An MPI_INFO object is an opaque object and is accessed via a handle.

The procedure MPI_INFO_CREATE creates a new MPI_INFO object and returns a handle to the object. Here the object is empty. The procedure MPI_INFO_SET sets a key-value pair (key, value) to the object, and the procedure MPI_INFO_GET procedure retrieves a pair (key, value) out of the object. The procedure MPI_INFO_DELETE removes information from the object. Other procedures handling MPI_INFO are as follows; The procedure MPI_INFO_GET_VALUE_LEN returns a value string length, the procedure MPI_INFO_NKEYS returns the number of currently defined keys, the procedure MPI_INFO_NTHKEY returns the n-th key, the procedure MPI_INFO_DUP duplicates an object, and the procedure MPI_INFO_FREE frees an object and makes its handle MPI_INFO_NULL.

One example of the MPI_INFO object use is to provide optimization information for MPI-IO operations. In MPI-IO, an MPI_INFO object may be used to inform the MPI-IO system of the file access pattern, the characteristic of an underlying file system or storage device, or the internal buffer size suitable for an application, a storage device, or a file system.

The following key values of the MPI_INFO object are currently interpreted by NEC MPI:

cb_buffer_size
specifies the maximum size of the buffer that is used by the collective MPI I/O functions. If the key is not defined, the buffer size is set to 4,194,304 bytes. The value must be the same for all processes that belong to the file communicator.

cb_nodes
specifies the number of processes that actually perform collective MPI I/O such as the procedures MPI_FILE_WRITE_ALL and MPI_FILE_READ_ALL. If the key is not defined, the number of processes for the file communicator is used. The value must be the same for all processes that belong to the file communicator.

host
contains the host on which new processes are spawned. This value is used by the procedures MPI_COMM_SPAWN and MPI_COMM_SPAWN_MULTIPLE. If host is not specified, the new processes are spawned on the host on which the root process of a group of spawning processes is located. The rules for specifying the host are the same as those in the description.

vh
sh
specifies 1 as value to spawn new processes on VH or Scalar host. This value is used by the procedures MPI_COMM_SPAWN and MPI_COMM_SPAWN_MULTIPLE.

ve
contains the VE node number of VE on which new processes are spawned. This value is used by the procedures MPI_COMM_SPAWN and MPI_COMM_SPAWN_MULTIPLE.

ind_rd_buffer_size
specifies the maximum size of the read buffer that is used by the MPI I/O functions in order to read non-contiguous data. If the key is not defined, the buffer size is set to 4,194,304 bytes.

ind_wr_buffer_size
specifies the maximum size of the write buffer that is used by the MPI I/O functions in order to write non-contiguous data. If the key is not defined, the buffer size is set to 4,194,304 bytes.

ip_port
contains the IP port number to establish a port. This value is interpreted by the procedure MPI_OPEN_PORT. If ip_port is not specified, the port number is selected by the system.

Note: ip_address and ip_port are used in the same manner as the bind(2-B) function.


2.12   Language Interoperability

In the case that an MPI program is implemented in both Fortran and C/C++ languages (language-mix), a handle for MPI objects (see the table) must be translated, if the MPI object is shared by Fortran routines and C/C++ routines. In order to translate a handle for MPI objects, C binding routines are provided. There are no corresponding Fortran bindings.

For example, the following procedure translates a handle for an MPI communicator used in C language into a handle that can be used in Fortran language.

MPI_Fint MPI_COMM_C2F(MPI_Comm c_comm)

where c_comm refers to a communicator's handle used in C language and the return value of the procedure MPI_COMM_C2F can be used for a handle the can be used in Fortran language (MPI_Fint refers to a datatype corresponding to INTEGER type in Fortran). The following procedure translates a handle for Fortran language to that for C/C++.

MPI_Comm MPI_COMM_F2C(MPI_Fint f_comm)

When MPI communication status (MPI_Status) is translated, the following procedures are used, that takes pointer arguments to the translation target and the result.

int MPI_STATUS_C2F(MPI_Status *c_status, MPI_Fint *f_status)
int MPI_STATUS_F2C(MPI_Fint *f_status, MPI_Status *c_status)

Table 2-3 shows a list for MPI objects and corresponding procedures for translating handles.

Table 2-3   MPI Objects and Handle Translation Procedures

MPI Object
Datatype of MPI Object
Translation Procedure
From C to Fortran
From Fortran to C
Communicator MPI_Comm MPI_COMM_C2F MPI_COMM_F2C
Datatype MPI_Dattype MPI_TYPE_C2F MPI_TYPE_F2C
Group MPI_Group MPI_GROUP_C2F MPI_GROUP_F2C
Request MPI_Request MPI_REQUEST_C2F MPI_REQUEST_F2C
File MPI_File MPI_FILE_C2F MPI_FILE_F2C
Window MPI_Win MPI_WIN_C2F MPI_WIN_F2C
Reduction Operation MPI_Op MPI_OP_C2F MPI_OP_F2C
Info MPI_Info MPI_INFO_C2F MPI_INFO_F2C
Status MPI_Status MPI_STATUS_C2F MPI_STATUS_F2C
Message MPI_Message MPI_MESSAGE_C2F MPI_MESSAGE_F2C


2.13   Nonblocking MPI Procedures in Fortran Programs

When any of the following array arguments is passed as a communication buffer or I/O buffer of a nonblocking MPI procedure in a Fortran program, The Fortran compiler would allocate a work array, copy the argument into the work array, and pass the work array to the MPI procedure in order to guarantee consecutiveness of array elements. In such a case, however, data transfer or data I/O can be performed after the MPI nonblocking procedure returns and the work array is deallocated by the compiler, which can result in unexpected behaviors such as missing of data.

If any of these arrays is passed to a nonblocking MPI procedure, users must explicitly allocate a temporary array and use it as the actual argument as shown in the following examples.

  real :: A(1000)
  integer :: request

  call MPI_ISEND(A+10.0, ..., request, ...)   ! array expression,
    :                                         ! (not permitted here)
    :
  call MPI_WAIT(request, ...)

should be rewritten

  real :: A(1000)
  real,allocatable :: tmp(:)
  integer :: request

  allocate(tmp(1000))                       ! allocate user work array
  tmp = A + 10.0
  call MPI_ISEND(tmp, ..., request, ...)    ! MPI call with the user work array
    :
    :
  call MPI_WAIT(request, ...)
  deallocate(tmp)                           ! frees the user work array

Figure 2-1   An Example of User Work Array (for Send Operation)

  real :: A(1000)
  integer :: request

  call MPI_IRECV(A(1:1000:2), ...., request, ...) ! array section,
    :                                             ! (not permitted here)
    :
  call MPI_WAIT(request, ...)

should be rewritten

  real :: A(1000)
  real,allocatable :: tmp(:)
  integer :: request

  allocate(tmp(500))                      ! allocate user work array
  call MPI_IRECV(tmp, ..., request, ....) ! MPI call with the user work array
    :
    :
  call MPI_WAIT(request, ...)
  A(1:1000:2) = tmp                       ! copies the user work array
                                          ! to a target array
  deallocate(tmp)                         ! frees the user work array

Figure 2-2   An Example of User Work Array (for Receive Operation)

Contents Previous Chapter Next Chapter Glossary Index