Basic Usage
This section describes how to call the SCA interface in your program.
Overview
The procedure using the SCA interface consists of the following five steps.
Here is a simple example that adds adjacent elements of an one-dimensional array xin
, whose size is 10, and stores the result to xout
.
>>> import nlcpy
>>> xin = nlcpy.arange(10, dtype='f4')
>>> xout = nlcpy.zeros_like(xin)
>>> dxin, dxout = nlcpy.sca.create_descriptor((xin, xout)) # Creation of Stencil Descriptor
>>> desc_i = dxin[-1] + dxin[0] + dxin[1] # Definition of Stencil Description
>>> desc_o = dxout[0]
>>> kern = nlcpy.sca.create_kernel(desc_i, desc_o=desc_o) # Creation of Kernel
>>> res = kern.execute() # Execution of Kernel
>>> res
array([ 0., 3., 6., 9., 12., 15., 18., 21., 24., 0.], dtype=float32)
>>> nlcpy.sca.destroy_kernel(kern) # Destruction of Kernel
Creation of Stencil Descriptor
A stencil descriptor can be created from a nlcpy.ndarray
used in the stencil computation.
The stencil descriptor is a Python object that can represent a stencil shape and is associated with the nlcpy.ndarray
.
In [1]:
>>> xin = nlcpy.arange(10, dtype='f4')
>>> xout = nlcpy.zeros_like(xin)
>>> dxin, dxout = nlcpy.sca.create_descriptor((xin, xout))
Returns one or more stencil descriptors. |
Definition of Stencil Description
Elements of stencil descriptor described above can concretely define a stencil description, which means “stencil shape”. The stencil description can be denoted by relative indices of the stencil descriptor. The following example defines that adds adjacent elements for each element of a one-dimensional array:
In [2]:
>>> desc_i = dxin[-1] + dxin[0] + dxin[1]
>>> desc_i
Out[2]:
stencil description
in_0[0, 0, 0, -1] +
in_0[0, 0, 0, 0] +
in_0[0, 0, 0, 1]
assigned arrays
in_0: shape=(10,), dtype=float32 array
computation size
nx = 8, ny = 1, nz = 1, nw = 1
For details of how to set coefficients for the input description, please see Applying Coefficient.
You can also define the output description if you need. The output description is useful when you specify an array offset for the output. For details of the array offset, please see Offset Adjustment for Output Array.
In [3]:
>>> desc_o = dxout[0]
>>> desc_o
Out[3]:
stencil description
in_0[0, 0, 0, 0]
assigned arrays
in_0: shape=(10,), dtype=float32 array
computation size
nx = 10, ny = 1, nz = 1, nw = 1
Creation of Kernel
After defining the stencil description, you can create a SCA kernel, which is an instruction sequence required for computations defined by the stencil description. nlcpy.sca.create_kernel()
dynamically generates the instruction sequence, stores it into the memory on VE, and returns the object of the SCA kernel.
In [4]:
>>> kern = nlcpy.sca.create_kernel(desc_i, desc_o=desc_o)
Creates a SCA kernel. |
Execution of Kernel
After the creation of the SCA kernel, you can execute the SCA kernel.
In [5]:
>>> res = kern.execute()
In [6]:
>>> res
Out[6]:
array([ 0., 3., 6., 9., 12., 15., 18., 21., 24., 0.], dtype=float32)
If you specify desc_o as a keyword argument to nlcpy.sca.create_kernel()
, the nlcpy.ndarray
returned by nlcpy.sca.kernel.kernel.execute()
is identical to the nlcpy.ndarray
which is associated with desc_o
. The IDs of them are the same.
In [7]:
>>> id(res) == id(xout)
Out[7]:
True
Destruction of Kernel
The destruction of the SCA kernel can be done as follows:
In [8]:
>>> nlcpy.sca.destroy_kernel(kern)
Even if you do not explicitly destroy the SCA kernel, it will be automatically destroyed by the garbage collector when there are no more references to the SCA kernel. However, for programs where the reference to the SCA kernel remains to the end, it may squeeze memory, so it is recommended to destroy the SCA kernel properly when it is no longer used.
Destroy a SCA kernel. |
Speedup Method (TIPS)
Stride Adjustment
Please use nlcpy.sca.convert_optimized_array()
to gain maximal performance.
This function converts ndarrays into optimized ndarrays, whose strides are adjusted to improve performance.
It is highly recommended to use this function from a performance standpoint, although it is not necessary to use it.
Note that nlcpy.sca.convert_optimized_array()
returns a copy of the input nlcpy.ndarray
, not a view. So, memory area of the returned nlcpy.ndarray
is different from that of the input nlcpy.ndarray
.
In [9]:
>>> import nlcpy
>>> x = nlcpy.random.rand(1000, 1000)
>>> x_opt = nlcpy.sca.convert_optimized_array(x, dtype='f8')
>>> x.strides
Out[9]:
(8000, 8)
In [10]:
>>> x_opt.strides
Out[10]:
(8008, 8)
In [11]:
>>> nlcpy.all(x == x_opt)
Out[11]:
array(True)
Converts existing ndarrays into optimized ndarrays, whose strides are adjusted to improve perfomance, filled with zeros. |
|
Creates an optimized ndarray, whose strides are adjusted to improve perfomance, filled with zeros. |
Kernel Reuse
To gain maximal performance, it is strongly recommended to reuse the created SCA kernel if the stencil description or the coefficients of the stencil kernel is unchanged. If you repeat to create SCA kernels, your program will not be able to obtain sufficient performance because the cost of the creating a SCA kernel is not so small compared to executing the kernel.