VEDA (VE Driver API) and VERA (VE Runtime API) are a CUDA Driver and Runtime API-like APIs for hybrid programming. It is based on AVEO. Most of the functionality is identical to the CUDA Driver API and CUDA Runtime API.

Introduction

VEDA is a parallel computing platform and hybrid programming model. It enables Vector engine (VE) for general purpose computing in a simple and elegant manner. VEDA API's are inspired by the widely used CUDA Driver API. It builds upon AVEO and enables easy porting of existing CUDA (and other hybrid) applications to VE. VEDA uses CUDA's design principles and maps these onto the execution model of AVEO.

Using VEDA, a programmer can execute code on VE and can control the execution from VH main program.

Installation

Prerequisite

VEDA API are build upon AVEO, Hence as a prerequisite first please install AVEO.
For veda program execution, veoffload-aveorun package is required.
For veda program development, veoffload-aveo and veoffload-aveo-devel packages are required.
For installation of aveo packages refer link https://sxauroratsubasa.sakura.ne.jp/documents/veos/en/aveo/md_GettingStarted.html

Installing runtime package

To run programs, please install veoffload-veda and the runtime packages of the compiler (2.2.2 or later).

To install the packages to run programs by yum, execute the following command as root: On ve1

# yum install veoffload-veda-ve1

On ve3

# yum install veoffload-veda-ve3

Installing development package

To develop programs, veoffload-veda-devel and the development packages of the compiler (2.2.2 or later) are also required.

To install the packages to develop programs by yum, execute the following command as root:

# yum install veoffload-veda-devel

libveda.vso is installed and it is linked with the libnc++.so(VE1)/libc++.so(VE3). Shared objects to be loaded with VEDA must be linked with libnc++.so(VE1)/libc++.so(VE3). Please confirm /opt/nec/ve/ncc/<version>/lib/libnc++.so(VE1) / /opt/nec/ve/ncc/<version>/lib/libc++.so(VE3) is installed when you build shared objects using nc++. <version> is the version of nc++ you are using to build your shared objects. Please specify -stdlib=compat for VE1 to link libnc++.so. Please specify -stdlib=libc++ for VE3 to link libc++.

Hello World

First, let's try a "Hello, World!" program on VE.

The required number of hugepages for VEDA

VEDA requires HugePages for data transfer. The required number of HugePages 32 per VEDA thread context.

VE code

Code to run on VE is shown below. Standard C functions are available, hence, you can use printf(3).

#include <stdio.h>
void hello_world(void) {
        printf("Hello World!\n");
}

Save the above code as libvehello.vc.

A function on VE called via VEDA needs to return a 64-bit unsigned integer. A function on VE called via VEDA can have arguments as mentioned later.

Compile VE code5

VEDA supports a function in a shared library.

To execute a function on VE using VEDA, compile and link a source file into a binary for VE.

To build a shared library with the functions for dynamic loading, execute as follows:

/opt/nec/ve/bin/ncc -x c -fpic -I/opt/nec/ve/share/veda/include -o libvehello.o -c libvehello.vc

/opt/nec/ve/bin/ncc -shared -o libvehello.vso libvehello.o

VH main program

Main routine on VH side to run VE program is shown here.

A program using VEDA needs to include "veda.h" and "vera.h" for VERA. In the header, the prototypes of functions and constants for VEDA and VERA API are defined.

The example VH program to call a VE function in a dynamic library with VEDA:

#include <stdio.h>
#include <stdlib.h>
#include <veda.h>
int main(int argc, char** argv) {
        printf("Hello World from Host!\n");
        vedaInit(0);
        VEDAcontext ctx;
        vedaDevicePrimaryCtxRetain(&ctx, 0);
        vedaCtxPushCurrent(ctx);
        VEDAmodule mod;
        vedaModuleLoad(&mod, "./libvehello.vso");
        VEDAfunction func;
        vedaModuleGetFunction(&func, mod, "hello_world");
        VEDAargs args;
        vedaArgsCreate(&args);
        vedaLaunchKernelEx(func, 0, args, 1, NULL);
        vedaCtxSynchronize();
        vedaExit();
        return 0;
}

Save the above code as hello.c

API vedaInit() is called to initialized VE devices, the initialized VE devices may be termed as VEDA device.
VE process is created on the VEDA device and the handle to the VE process is returned to the VH process. The returned handle to the VE process is termed as the VEDA context in the VEDA hybrid program.
vedaDevicePrimaryCtxRetain() retain the primary Context on the VEDA device or create a new context incase of no context and vedaCtxPushCurrent() pushes a context on the current CPU thread.
Internally VEDA context instantiate the VE threads on the VE device to execute the VE functions. These VE threads are termed as VEDA Streaming Multiprocessor(SM) throughout this mechanism.
VE compiled device code(libvehello.vc) is loaded into the VE memory by calling vedaModuleLoad(), which loads a VE shared library.
VE device address of the VE device function “hello_world” is retrieved by calling vedaModuleGetFunction() function. Note VE device function is termed as VEDA device function in VEDA hybrid program.
By calling vedaLaunchKernelEx(), VEDA hybrid program submits the request for the execution of the VEDA device function to VEDA streaming Multiprocessor(SM). It may be termed as launching of the VEDA device function. VEDA arguement would be destroyed after VEDA device function is called as fourth argument is set to 1.
Execution of the VH programs is blocked by calling vedaCtxSynchronize() since the execution of the “hello_world” program is finished, this is termed as VEDA streaming multiprocessor(SM) synchronization.
API vedaExit() releases the VEDA driver API library gracefully and perform the proper cleanup of the VEDA driver library.

Compile VH main program

Compile source code on VH side as shown below.

$ gcc -o hello hello.c -I/opt/nec/ve/share/veda/include -L/opt/nec/ve/veos/lib64 \

-rdynamic -Wl,-rpath,/opt/nec/ve/veos/lib64 -lveda

The headers for VEDA and VERA are installed in /opt/nec/ve/veos/include. libveda and libvera, the shared library of VEDA and VERA, is in /opt/nec/ve/veos/lib64.

Run a program with VEDA

Execute the compiled VEDA program.

$ ./hello

Hello, world!

Various arguments for a VE function

You can pass one or more arguments to a function on VE. To specify arguments, VEDA arguments object is used. A VEDA argument object is created by vedaArgsCreate(). When a VEDA argument object is created, the VEDA argument object is empty, without any arguments passed. Even if a VE function has no arguments, a VEDA arguments object is still necessary.

VEDA provides functions to set an argument in various types.

Basic Types

To pass an integer value, the following functions are used.

VEDAresult vedaArgsSetI64 (VEDAargs args, const int idx, const int64_t value);
VEDAresult vedaArgsSetU16 (VEDAargs args, const int idx, const uint16_t value);
VEDAresult vedaArgsSetI32 (VEDAargs args, const int idx, const int32_t value);
VEDAresult vedaArgsSetU32 (VEDAargs args, const int idx, const uint32_t value);
VEDAresult vedaArgsSetI16 (VEDAargs args, const int idx, const int16_t value);
VEDAresult vedaArgsSetU16 (VEDAargs args, const int idx, const int16_t value);
VEDAresult vedaArgsSetU8 (VEDAargs args, const int idx, const uint8_t value);
VEDAresult vedaArgsSetI8 (VEDAargs args, const int idx, const int8_t value);

You can pass also a floating point number argument.

VEDAresult vedaArgsSetF32 (VEDAargs args, const int idx, const float value);

VEDAresult vedaArgsSetF64 (VEDAargs args, const int idx, const double value);

For instance: suppose that veda device is initialized and func(int, double) is defined in a VE library whose handle is func1.

VEDAargs args;
vedaArgsCreate(&args);
vedaArgsSetI32(args, 0, 1);
vedaArgsSetF64(args, 1, 2.0);
VEDAmodule mod;
vedaModuleLoad(&mod, "./libvehello.vso"));
VEDAfunction func1;
vedaModuleGetFunction(&func1, mod, "func");
vedaLaunchKernel(func1, 0, args);

In this case, func(1, 2.0) is called on VE.

Stack Arguments

Non basic typed arguments and arguments by reference are put on a stack. VEDA supports an argument on a stack.

To set a stack argument to a VEDA arguments object, call vedaArgsSetStack().

VEDAresult vedaArgsSetStack (VEDAargs args, const int idx, void* ptr,

VEDAargs_intent intent, const size_t size);

The fourth argument specifies the argument is for input and/or output.

VEDA_ARGS_INTENT_IN: the argument is for input; data is copied into a VE stack on call.
VEDA_ARGS_INTENT_OUT: the argument is for output; a VE stack area is allocated without copy-in and data is copied out to VH memory on completion.
VEDA_ARGS_INTENT_INOUT: the argument is for both input and output; data is copied into and out from a VE stack area.

Fortran program

VE code

Code written by Fortran to run on VE is shown below.

SUBROUTINE sub1(x, ret)
  implicit none
  INTEGER, INTENT(IN) :: x
  INTEGER, INTENT(OUT) :: ret
  ret = x + 1
END SUBROUTINE SUB1

Save the above code as libvefortran.vf90.

Compile VE code

To build a shared library with the functions for dynamic loading, execute as follows:

$/opt/nec/ve/bin/nfort -x f95 -I/opt/nec/ve/share/veda/include -c -o libvefortran.o libvefortran.vf90

$/opt/nec/ve/bin/nfort -shared -fpic -o libvefortran.vso libvefortran.o

VH main program

Main routine on VH side to run VE program written by Fortran is shown here.

The example VH program to call a VE Fortran function in a dynamically linked executable:

#include <stdio.h>
#include <stdlib.h>
#include <veda.h>
#define VEDA(err) check(err, __FILE__, __LINE__)
void check(VEDAresult err, const char* file, const int line) {
        if(err != VEDA_SUCCESS) {
                const char *name, *str;
                vedaGetErrorName        (err, &name);
                vedaGetErrorString      (err, &str);
                printf("%s: %s @ %s:%i\n", name, str, file, line);
                exit(1);
        }
}
int main(int argc, char** argv) {
        printf("Hello World from Host!\n");
        VEDA(vedaInit(0));
        VEDAcontext ctx;
        VEDA(vedaDevicePrimaryCtxRetain(&ctx, 0));
        VEDA(vedaCtxPushCurrent(ctx));
        VEDAmodule mod;
        VEDA(vedaModuleLoad(&mod, "./libvefortran.vso"));
        VEDAfunction func;
        VEDA(vedaModuleGetFunction(&func, mod, "sub1_"));
        VEDAargs args;
        VEDA(vedaArgsCreate(&args));
        size_t x = 42;
        size_t y = 1;
        VEDA(vedaArgsSetStack(args, 0, &x, VEDA_ARGS_INTENT_IN, sizeof(x)));
        VEDA(vedaArgsSetStack(args, 1, &y, VEDA_ARGS_INTENT_OUT, sizeof(y)));
        VEDA(vedaLaunchKernel(func, 0, args));
        VEDA(vedaArgsDestroy(args));
        VEDA(vedaCtxSynchronize());
        printf("SUB1 return %lu\n",y);
        VEDA(vedaExit());
        return 0;
}

Save the above code as fortran.c.

API vedaInit() is called to initialized VE devices, the initialized VE devices may be termed as VEDA device.
VE process is created on the VEDA device and the handle to the VE process is returned to the VH process. The returned handle to the VE process is termed as the VEDA context in the VEDA hybrid program.
vedaDevicePrimaryCtxRetain() retain the primary Context on the VEDA device or create a new context incase of no context and vedaCtxPushCurrent() pushes a context on the current CPU thread.
Internally VEDA context instantiate the VE threads on the VE device to execute the VE functions. These VE threads are termed as VEDA Streaming Multiprocessor(SM) throughout this mechanism.
VE compiled device code(libvefortran.vf90) is loaded into the VE memory by calling vedaModuleLoad(), which loads a VE shared library.
VE device address of the VE device function “sub1_” is retrieved by calling vedaModuleGetFunction() function. Note VE device function is termed as VEDA device function in VEDA hybrid program.
API vedaArgsCreate() instantiate the VEDA function argument handler.
API vedaArgsSetStack() initialize the VEDA function argument to point the buffer on stack. Where x is treated as the Input buffer to the VEDA device function.
API vedaArgsSetStack() initialize the VEDA function argument to point the buffer on stack. Where y is treated as the Output buffer, some output is expected.
By calling vedaLaunchKernel(), VEDA hybrid program submits the request for the execution of the VEDA device function to VEDA streaming Multiprocessor(SM). It may be termed as launching of the VEDA device function.
Execution of the VH programs is blocked by calling vedaCtxSynchronize() since the execution of the “sub1_” program is finished, this is termed as VEDA streaming multiprocessor(SM) synchronization.
API vedaExit() releases the VEDA driver API library gracefully and perform the proper clean-up of the VEDA driver library.

For passing arguments to VE Fortran function, please use vedaArgsSetStack() to pass arguments as stack arguments. However, for passing arguments to arguments with VALUE attribute in Fortran function, please pass arguments by value in the same way as VE C function.

When you want to load VE Fortran function by vedaModuleGetFunction() with the name of a Fortran function, please change the name of the Fortran function to lowercase, and add "_" at the end of the function name.

Taking libvefortran.vf90 and fortran.c as an example, pass "sub1_" as an argument to vedaModuleGetFunction() in fortran.c when calling the Fortran function named "SUB1" in libvefortran.f90.

The method of compiling and running VH main program are same as C program.

Compile VH main program

Compile source code on VH side as shown below. This is the same as the compilation method described above.

$ gcc -o fortran fortran.c -I/opt/nec/ve/share/veda/include -L/opt/nec/ve/veos/lib64 \

-rdynamic -Wl,-rpath,/opt/nec/ve/veos/lib64 -lveda

Run a program with VEDA

Execute the compiled VEDA program. This is also the same as the execution method described above.

$ ./fortran

SUB1 return 43

Parallelized program using OpenMP

OpenMP in C

VE code using OpenMP in C

The following is an example of VE code using OpenMP written in C.

#include <stdio.h>
int omp_hello(void)
{
  int tid, nthreads = 0;
#pragma omp parallel private(nthreads, tid)
  {
    tid = omp_get_thread_num();
    printf("Hello, World! from thread = %d\n", tid);
    if (tid == 0)
    {
      nthreads = omp_get_num_threads();
      printf("Number of threads = %d\n", nthreads);
    }
  }  /* All threads join master thread and disband */
  fflush(stdout);
  return 0;
}

Save the above code in libomphello.vc.

How to build VE code

To use OpenMP parallelization, specify -fopenmp at compilation and linking.

Here is an example of building VE code written in C.

To build a shared library, execute as follows:

$/opt/nec/ve/bin/ncc -x c -fpic -I/opt/nec/ve/share/veda/include -o libomphello.o -c libomphello.vc -fopenmp

$/opt/nec/ve/bin/ncc -shared -o libomphello.vso libomphello.o

Compile VH main program

Example omphello.c which calls the above omp VE code,

Compile source code on VH side as shown below.

$ gcc -o omphello omphello.c -I/opt/nec/ve/share/veda/include -L/opt/nec/ve/veos/lib64 \

-rdynamic -Wl,-rpath,/opt/nec/ve/veos/lib64 -lveda

Execute the compiled omp VEDA program.

$ ./omphello
Hello World from Host!
Hello, World! from thread = 4
Hello, World! from thread = 2
Hello, World! from thread = 0
Hello, World! from thread = 1
Hello, World! from thread = 5
Hello, World! from thread = 3
Hello, World! from thread = 7
Hello, World! from thread = 6
Number of threads = 8

OpenMP in Fortran

VE code using OpenMP in Fortran

The following shows the example written in Fortran.

INTEGER FUNCTION OMP_HELLO()
  INTEGER :: TID = 0
  INTEGER :: NTHREADS = 0
!$OMP PARALLEL PRIVATE(TID, NTHREADS)
  TID = omp_get_thread_num()
  WRITE(*,*) "Hello, World! from thread = ", TID
  IF ( TID == 0 ) THEN
    NTHREADS = omp_get_num_threads()
    OMP_HELLO = NTHREADS
    WRITE(*,*) "Number of threads = ", NTHREADS
  END IF
!$OMP END PARALLEL
END FUNCTION OMP_HELLO

Save the above code in libompfortran.vf90.

How to build VE code

To use OpenMP parallelization, specify -fopenmp at compilation and linking.

Here is an example of building VE code written in Fortran.

To build a shared library, execute as follows:

$/opt/nec/ve/bin/nfort -x f95 -fpic -I/opt/nec/ve/share/veda/include -o libompfortran.o -c libompfortran.vf90 -fopenmp

$/opt/nec/ve/bin/nfort -shared -o libompfortran.vso libompfortran.o

To build code written in Fortran, change the compiler to nfort.

Compile VH main program

Example omphellofortran.c which calls the above omp VE code,

Compile source code on VH side as shown below.

$ gcc -o omphellofortran omphellofortran.c -I/opt/nec/ve/share/veda/include -L/opt/nec/ve/veos/lib64 \

-rdynamic -Wl,-rpath,/opt/nec/ve/veos/lib64 -lveda

Execute the compiled omp VEDA program.

$ ./omphellofortran
Hello World from Host!
Hello, World! from thread = 4
Hello, World! from thread = 2
Hello, World! from thread = 0
Hello, World! from thread = 1
Hello, World! from thread = 5
Hello, World! from thread = 3
Hello, World! from thread = 7
Hello, World! from thread = 6
Number of threads = 8

MPI program with VEDA

A MPI program on VH side can call VEDA API to call VE device functions.

VE code

Code to run on VE is shown below.

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdint.h>
uint64_t
init( int *p, int v, int n )
{
    printf( "VE init : %d %d %p\n", v, n, p );
    fflush( stdout );
    int *ve;
    vedaHMemPtr(&ve, p);
    int i;
    for ( i = 0; i < n; i++ ) ve[i] = v;
    return (uint64_t)ve;
}
uint64_t
check( int *p, int v, int n )
{
    printf( "VE init : %d %d %p\n", v, n, p );
    fflush( stdout );
    int *ve;
    vedaHMemPtr(&ve, p);
    int i;
    for ( i = 0; i < n; i++ ) {
       if ( ve[i] != v ) break;
    }
    return i < n ? ~0ULL : 0ULL;
}

Save the above code as libmpive.vc.

Compile VE code

To execute a function on VE using VEDA, compile and link a source file into a binary for VE. To build a shared library with the functions for dynamic loading, execute as follows:

/opt/nec/ve/bin/ncc -x c -fpic -I/opt/nec/ve/share/veda/include -o libmpive.o -c libmpive.vc

/opt/nec/ve/bin/ncc -shared -o libmpive.vso libmpive.o

VH MPI program

Main routine on VH side to run VE program written by C is shown here. The example VH program to call a VE C function in a dynamically linked executable:

#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <mpi.h>
#include <veda.h>
#define CHECK(err) check(err, __FILE__, __LINE__)
#define MAX_LEN (1024*1024)
void check(VEDAresult err, const char* file, const int line) {
        if(err != VEDA_SUCCESS) {
                const char* name = 0;
                vedaGetErrorName(err, &name);
                printf("Error: %i %s @ %s (%i)\n", err, name, file, line);
                assert(0);
                exit(1);
        }
}
int
main()
{
    int myrank;
    int nprocs;
    int nelems = MAX_LEN;
    int dev,cnt;
    uint64_t rc;
    uint64_t id;
    MPI_Init( 0, 0 );
    MPI_Comm_rank( MPI_COMM_WORLD, &myrank );
    MPI_Comm_size( MPI_COMM_WORLD, &nprocs );
    printf( "myrank = %d, nprocs = %d, started\n", myrank, nprocs );
    fflush( stdout );
    CHECK(vedaInit(0));
    VEDAcontext cont;
    CHECK(vedaDevicePrimaryCtxRetain(&cont, 0));
    CHECK(vedaCtxPushCurrent(cont));
    CHECK(vedaCtxStreamCnt(&cnt));
    printf( "myrank = %d, nprocs = %d, context = %16p, created\n", myrank, nprocs, cont );
    VEDAmodule mod;
    const char* modName = "./libmpive.vso";
    CHECK(vedaModuleLoad(&mod, modName));
    //
    // Allocate VE memory
    //
    VEDAhmemptr vebuf;
    CHECK(vedaHMemAlloc(&vebuf, sizeof(int) * nelems));
    printf( "myrank = %d, nprocs = %d, vebuf = %16p, allocated\n", myrank, nprocs, vebuf );
    //
    // Call VE function 'init' to initialize VE memory
    //
    VEDAargs args1;
    CHECK(vedaArgsCreate(&args1));
    CHECK(vedaArgsSetHMEM(args1, 0, vebuf));
    CHECK(vedaArgsSetI32(args1, 1, myrank));
    CHECK(vedaArgsSetI32(args1, 2, nelems));
    VEDAfunction func1;
    const char* funcName1 = "init";
    CHECK(vedaModuleGetFunction(&func1, mod, funcName1));
    CHECK(vedaLaunchKernel(func1, 0, args1));
    rc = vedaCtxSynchronize();
    printf( "myrank = %d, nprocs = %d, rc=%d, VE offload\n", myrank, nprocs, rc );
    int peer = myrank^1;
    if ( peer >= nprocs ) peer = MPI_PROC_NULL;
    //
    // Call VE function 'check' to check initialized VE memory
    //
    VEDAargs args2;
    CHECK(vedaArgsCreate(&args2));
    CHECK(vedaArgsSetHMEM(args2, 0, vebuf));
    CHECK(vedaArgsSetI32(args2, 1, myrank));
    CHECK(vedaArgsSetI32(args2, 2, nelems));
    VEDAfunction func2;
    const char* funcName2 = "check";
    CHECK(vedaModuleGetFunction(&func2, mod, funcName1));
    CHECK(vedaLaunchKernel(func2, 0, args2));
    rc = vedaCtxSynchronize();
    printf( "myrank = %d, nprocs = %d, rc=%d, VE offload\n", myrank, nprocs, rc );
    MPI_Status        status;
    MPI_Barrier(MPI_COMM_WORLD);
    //
    // MPI calls on VH with VE memory buffers
    //
    if ( myrank%2 == 0) {
        MPI_Send((void *)vebuf, MAX_LEN, MPI_INT, peer, 0, MPI_COMM_WORLD);
    }
    else {
        MPI_Recv((void *)vebuf, MAX_LEN, MPI_INT, peer, 0, MPI_COMM_WORLD, &status );
    }
    MPI_Barrier(MPI_COMM_WORLD);
    //
    // Call VE function 'check' to check VE memory after MPI calls
    //
    VEDAargs args3;
    CHECK(vedaArgsCreate(&args3));
    CHECK(vedaArgsSetHMEM(args3, 0, vebuf));
    CHECK(vedaArgsSetI32(args3, 1, myrank));
    CHECK(vedaArgsSetI32(args3, 2, nelems));
    CHECK(vedaLaunchKernel(func2, 0, args3)); //func2 is check()
    rc = vedaCtxSynchronize();
    printf( "myrank = %d, nprocs = %d, rc=%d, VE offload\n", myrank, nprocs, rc );
    // Verify
    uint64_t result;
    MPI_Reduce( &rc, &result, 1, MPI_LONG_LONG, MPI_SUM, 0, MPI_COMM_WORLD );
    if ( myrank == 0 ) {
        printf( "Result : %s\n", result ? "Failed" : "Success" );
        fflush( stdout );
    }
    vedaArgsDestroy(args1);
    vedaArgsDestroy(args2);
    vedaArgsDestroy(args3);
    CHECK(vedaHMemFree(vebuf));
    CHECK(vedaCtxDestroy(cont));
    MPI_Finalize();
    vedaExit();
    return 0;
}

Save the above code as mpi_veda.c

Compile VH MPI program

Compile source code on VH side as shown below.

source /opt/nec/ve/mpi/3.0.0/bin/necmpivars.sh gnu
mpincc -vh mpi_veda.c -o mpi_veda -I/opt/nec/ve/share/veda/include  -L/opt/nec/ve/veos/lib64 \
-rdynamic -Wl,-rpath,/opt/nec/ve/veos/lib64 -lveda

Run the program using MPI

Execute the compiled VEDA program.

VE_OMP_NUM_THREADS=1 mpirun -veo -np 4 ./mpi_veda
myrank = 2, nprocs = 4, started
myrank = 3, nprocs = 4, started
myrank = 1, nprocs = 4, started
myrank = 0, nprocs = 4, started
myrank = 1, nprocs = 4, context =         0xf0a5b8, created
myrank = 0, nprocs = 4, context =        0x1cab5b8, created
myrank = 3, nprocs = 4, context =        0x22b05b8, created
myrank = 2, nprocs = 4, context =        0x20ad5b8, created
myrank = 1, nprocs = 4, vebuf = 0x8000601000f47de0, allocated
myrank = 0, nprocs = 4, vebuf = 0x8000601000f47de0, allocated
myrank = 3, nprocs = 4, vebuf = 0x8000601000f47de0, allocated
myrank = 2, nprocs = 4, vebuf = 0x8000601000f47de0, allocated
VE init : 1 1048576 0x601000f47de0
VE init : 0 1048576 0x601000f47de0
VE init : 3 1048576 0x601000f47de0
VE init : 2 1048576 0x601000f47de0
myrank = 1, nprocs = 4, rc=0, VE offload
myrank = 0, nprocs = 4, rc=0, VE offload
myrank = 3, nprocs = 4, rc=0, VE offload
VE init : 0 1048576 0x601000f47de0
myrank = 2, nprocs = 4, rc=0, VE offload
VE init : 1 1048576 0x601000f47de0
myrank = 0, nprocs = 4, rc=0, VE offload
myrank = 1, nprocs = 4, rc=0, VE offload
VE init : 3 1048576 0x601000f47de0
VE init : 2 1048576 0x601000f47de0
myrank = 3, nprocs = 4, rc=0, VE offload
myrank = 2, nprocs = 4, rc=0, VE offload
VE init : 3 1048576 0x601000f47de0
VE init : 1 1048576 0x601000f47de0
VE init : 0 1048576 0x601000f47de0
VE init : 2 1048576 0x601000f47de0
myrank = 1, nprocs = 4, rc=0, VE offload
myrank = 3, nprocs = 4, rc=0, VE offload
myrank = 0, nprocs = 4, rc=0, VE offload
myrank = 2, nprocs = 4, rc=0, VE offload
Result : Success

Alternate way of building VEDA

VEDA Hybrid Offloading

(1) VEDA projects can be compiled including VEDA host and device code within a single CMake build system.
(2) We need to write separate files for host and device code.
(3) VEDA cmake project can be configured by setting CMAKE_MODULE_PATH to the VEDA cmake directory path.
(4) The languages VEDA_C, VEDA_CXX or VEDA_Fortran need to be enabled in cmake configuration file, by which it finds the scripts to compile the appropriate source code extension files of veda device.
(5) VEDA device source code file extensions need to be prefixed with v as filename.vc, filename.vcpp, filename.vf etc. afterwards, the entire compilation and linking process is automatically handled by CMake.

SET(CMAKE_MODULE_PATH /opt/nec/ve/share/veda/cmake)
FIND_PACKAGE(VEDA)
ENABLE_LANGUAGE(VEDA_C)
INCLUDE_DIRECTORIES(${VEDA_INCLUDE_DIRS})
ADD_EXECUTABLE          (hello_world host.c)
TARGET_LINK_LIBRARIES   (hello_world ${VEDA_LIBRARY})
ADD_LIBRARY             (hello_world_device SHARED device.vc)

Save the above code in CMakeLists.txt for C project

Sample C project directory:
CMakeLists.txt device.vc host.c

Sample CXX project directory:
CMakeLists.txt device.vcpp host.cpp

Sample Fortran project directory:
CMakeLists.txt device.vf host.c

How to build your project

To build a project, execute as follows:

mkdir build
cd build
cmake ../
make

Execute the build VEDA project binary:

$ ./hello_world
Hello World from Host!
Hello World from Device!

NUMA support

VEDA supports VE NUMA nodes since v0.10. To enable NUMA on your system you need to execute (set -N ? to specific device index):

VCMD="sudo /opt/nec/ve/bin/vecmd -N ?"
$VCMD vconfig set partitioning_mode on
$VCMD state set off
$VCMD state set mnt
$VCMD reset card

VEDA then recognizes each NUMA node as a separate device, i.e. with 2 physical devices in NUMA mode, VEDA would show 4 devices. You can use VEDAresult vedaDeviceDistance(float* distance, VEDAdevice devA, VEDAdevice devB) to determine the relationship of two VEDAdevices.

distance == 0.0; // same device
distance == 0.5; // same physical device, different NUMA node
distance == 1.0; // differeny physical device

Example program of NUMA node

#include <veda.h>
#include <cstdio>
#include <cstdlib>
#include <cassert>
#define CHECK(err) check(err, __FILE__, __LINE__)
void check(VEDAresult err, const char* file, const int line) {
        if(err != VEDA_SUCCESS) {
                const char* name = 0;
                vedaGetErrorName(err, &name);
                printf("Error: %i %s @ %s (%i)\n", err, name, file, line);
                assert(false);
                exit(1);
        }
}
int main(int argc, char** argv) {
        CHECK(vedaInit(0));
        int devcnt;
        CHECK(vedaDeviceGetCount(&devcnt));
        for(int devA = 0; devA < devcnt; devA++) {
                for(int devB = 0; devB < devcnt; devB++) {
                        float distance = 0;
                        CHECK(vedaDeviceDistance(&distance, devA, devB));
                        printf("Distance %i >> %i = %f (%s)\n", devA, devB, distance, distance == 0 ? "same device" : distance == 0.5f ? "different numa node" : "different device");
                }
        }
        CHECK(vedaExit());
        return 0;
}

Save the above code in numa_node.cpp.

Compile source code on VH side as shown below.

$ g++ -o numa_node numa_node.cpp -I/opt/nec/ve/share/veda/include -L/opt/nec/ve/veos/lib64 \

-rdynamic -Wl,-rpath,/opt/nec/ve/veos/lib64 -lveda

Execute the compiled code.

$ ./numa_node
Distance 0 >> 0 = 0.000000 (same device)
Distance 0 >> 1 = 0.500000 (different numa node)
Distance 1 >> 0 = 0.500000 (different numa node)
Distance 1 >> 1 = 0.000000 (same device)

VEDA-smi

The executable veda-smi displays available VEDA devices in your system. It uses the VEDA_VISIBLE_DEVICES env var and therefore only shows the devices that your VEDA application would be able to use. Use VEDA_VISIBLE_DEVICES= veda-smi to ensure that you see all installed devices.

╔ veda-smi ═════════════════════════════════════════════════════════════════════╗
║ VEDA Version: 2.10.0     AVEO Version: 2.8.2                                  ║
╚═══════════════════════════════════════════════════════════════════════════════╝
┌── #0  NEC SX-Aurora Tsubasa VE10B ────────────────────────────────────────────┐
  ┌ Physical: 1.0
  ├ AVEO:     0.0
  ├ Clock:    current: 1400 MHz, base: 800 MHz, memory: 1600 MHz
  ├ Firmware: 5399
  ├ Memory:   49152 MiB
  ├ Cache:    LLC: 8192kB, L2: 256kB, L1d: 32kB, L1i: 32kB
  ├ Temp:     56.4°C 56.4°C 57.0°C 56.1°C
  └ Power:    18.0W (11.9V, 1.5A)
└───────────────────────────────────────────────────────────────────────────────┘
┌── #1  NEC SX-Aurora Tsubasa VE10B ────────────────────────────────────────────┐
  ┌ Physical: 1.1
  ├ AVEO:     0.1
  ├ Clock:    current: 1400 MHz, base: 800 MHz, memory: 1600 MHz
  ├ Firmware: 5399
  ├ Memory:   49152 MiB
  ├ Cache:    LLC: 8192kB, L2: 256kB, L1d: 32kB, L1i: 32kB
  ├ Temp:     56.1°C 56.4°C 55.9°C 56.0°C
  └ Power:    18.0W (11.9V, 1.5A)
└───────────────────────────────────────────────────────────────────────────────┘
┌── #2  NEC SX-Aurora Tsubasa VE10B ────────────────────────────────────────────┐
  ┌ Physical: 0.0
  ├ AVEO:     1.0
  ├ Clock:    current: 1400 MHz, base: 800 MHz, memory: 1600 MHz
  ├ Firmware: 5399
  ├ Memory:   49152 MiB
  ├ Cache:    LLC: 16384kB, L2: 256kB, L1d: 32kB, L1i: 32kB
  ├ Temp:     53.8°C 53.5°C 54.1°C 53.8°C 53.8°C 54.1°C 53.2°C 53.5°C
  └ Power:    36.3W (11.9V, 3.1A)
└───────────────────────────────────────────────────────────────────────────────┘

Environment variables for VEDA

Environment variables	Brief	Default value
VE_LD_LIBRARY_PATH	Default library path of VE to check dynamic and shared libraries required for current VEDA program.	.(Current directory)
LD_LIBRARY_PATH	Default library path for VE to check for available dynamic and shared libraries of VEDA.	None
VEDA_VISIBLE_DEVICES	To restrict VEDA to only use those GPUs that have peer-to-peer support.	0

Environment variables to optimize data transfer

AVEO(2.7.5 or later) supports the environment variables to optimize the performance of data transfer.

Please refer below link to set AVEO related environment variables, https://sxauroratsubasa.sakura.ne.jp/documents/veos/en/aveo/md_GettingStarted.html