Execution
Parallel Execution
Python scripts using NLCPy can gain performance when it is executed in parallel by multithreading on a Vector Engine(VE).
The default number of parallel threads can be specified by the environment variable VE_OMP_NUM_THREADS
at the time of execution.
If the VE_OMP_NUM_THREADS
is undefined or has an invalid value, the maximum number of the available CPU cores on the system is set.
Examples are shown below.
Interactive mode:
$ VE_OMP_NUM_THREADS=4 python >>> import nlcpy
Subsequent computations will be performed by 4 parallel threads on a VE.
Non-interactive mode:
$ VE_OMP_NUM_THREADS=4 python example.py
Computations in example.py will be performed by 4 parallel threads on a VE.
When you want to specify VE Node number, you should use the environment variable VE_NODE_NUMBER
as follows.
Interactive mode:
$ VE_NODE_NUMBER=1 VE_OMP_NUM_THREADS=4 python >>> import nlcpy
Subsequent computations will be performed by 4 parallel threads on VE Node #1.
Non-interactive mode:
$ VE_NODE_NUMBER=1 VE_OMP_NUM_THREADS=4 python example.py
Computations in example.py will be performed by 4 parallel threads on VE Node #1.
Note
If you execute two or more NLCPy programs on a VE, context switching occurs. In such a case, the performance becomes significantly slower.
Using Multiple VEs
NLCPy provides two ways to use multiple VEs.
-
mpi4py-ve is a Message Passing Interface (MPI) Python library for SX-Aurora TSUBASA. It provides point-to-point and collective communication operations among processes. The available objects for MPI communication are numpy.ndarray and nlcpy.ndarray. Note that mpi4py-ve requires NEC MPI runtime packages.
For details, please refer to the mpi4py-ve project
-
You can select execution VE device in a Python script by using “with” context manager.
import nlcpy with nlcpy.venode.VE(0): # do something on VE#0 x_ve0 = nlcpy.arange(10) with nlcpy.venode.VE(1): # do something on VE#1 y_ve1 = nlcpy.arange(10) # transfer x_ve0 to VE#1 x_ve1 = nlcpy.venode.transfer_array(x_ve0, nlcpy.venode.VE(1)) with nlcpy.venode.VE(1): # do something on VE#1 z_ve1 = x_ve1 + y_ve1
Note that
nlcpy.venode.transfer_array()
cannot yet transfer data directly between VEs, in other words, the data transfer between VEs has to go through VH. If you need to transfer large data, we recommend using mpi4py-ve.By default, NLCPy creates a VE process on only physical VE#0 when executing
import nlcpy
. Other VE processes will be created at first call ofnlcpy.venode.VENode.__enter__()
,nlcpy.venode.VENode.use()
, ornlcpy.venode.VENode.apply()
. Please note that creation of VE process takes few seconds. If you want to create processes on some VEs atimport nlcpy
, you can set an environment variableVE_NLCPY_NODELIST=0,1,...
. NLCPy creates processes atimport nlcpy
for the physical VEs corresponding to the IDs set byVE_NLCPY_NODELIST
.When environment variable
VE_NLCPY_NODELIST
is set, it may be different from the argument id ofnlcpy.venode.VE()
to the physical VE id.Example of VE device mapping when environment variable is set by
VE_NLCPY_NODELIST=1,2
is the following:$ VE_NLCPY_NODELIST=1,2 python >>> import nlcpy >>> nlcpy.venode.VE(0) <VE node logical_id=0, physical_id=1> >>> nlcpy.venode.VE(1) <VE node logical_id=1, physical_id=2>
Optimization for Mathematical Functions
When the environment variable VE_NLCPY_FAST_MATH
is set to yes
or YES
,
NLCPy uses shared object libnlcpy_ve_kernel_fast_math.so
for VE.
By default, VE_NLCPY_FAST_MATH
is not set.
The shared objects ( libnlcpy_ve_kernel_fast_math.so
) have been compiled with the following optimization options in NEC C/C++ compiler.
-ffast-math
Uses fast scalar version mathematical functions outside of vectorized loops.
-mno-vector-intrinsic-check
Disable vectorized mathematical functions to check the value ranges of arguments.The target mathematical functions of this option are as follows:acos
,acosh
,asin
,atan
,atan2
,atanh
,cos
,cosh
,cotan
,exp
,exp10
,exp2
,expm1
,log10
,log2
,log
,pow
,sin
,sinh
,sqrt
,tan
,tanh
-freciprocal-math
Allows change an expression
x/y
tox * (1/y)
.-mvector-power-to-explog
Allows to replace
pow(R1,R2)
in a vectorized loop withexp(R2*log(R1))
.powf()
is replaced, too. By the replacement, the execution time would be shortened, but numerical error occurs rarely in the calculation.-mvector-low-precise-divide-function
Allows to use low precise version for vector floating divide operation. It is faster than the normal precise version but the result may include at most one bit numerical error in mantissa.
These optimizations cause side-effects.
For example, nan
or inf
might not be obtained correctly.
You can set VE_NLCPY_FAST_MATH
as follows:
Interactive mode:
$ VE_NLCPY_FAST_MATH=yes python >>> import nlcpy
Non-interactive mode:
$ VE_NLCPY_FAST_MATH=yes python example.py
Memory Pool Management
NLCPy reduces overhead of VE memory allocation by reusing pre-allocated memory (memory pool) without calling malloc and free.
You can control amount of memory pool by an environment variable VE_NLCPY_MEMPOOL_SIZE
.
The default value is set 1 GB.
For usage of this variable, please refer to the Environment Variables
In some cases, setting this variable to larger value than 1 GB may improve performance. However, an out of memory error may be caused by memory fragmentation.