A part of program can be offloaded from VE into VH with VH CAll. This is effective on the operations which are never vectorized and not good for VE. A typical example is formatted file I/O.
The attached File:FMTIOwithVHCALL.tgz is a sample program of formatted file output with vhcall.
About the detail of VH Call, a function of "libsyve", see the following.
https://sxauroratsubasa.sakura.ne.jp/documents/veos/en/libsysve/md_doc_VHCall.html
"FMTIOwithVHCALL.tgz" includes the followings.
./FMTIOwithVHCALL/
|-- Makefile: Makefile of the main program.
| vhcall is utilized if "CPPFLAGS=-DUSEVHCALL" is specified.
|-- ./lib/: Make environment of a shared library "libvhcall.so" runs on VH.
| |-- Makefile: Makefile of "libvhcall.so". gcc is utilized here.
| `-- libvhcall.c: Source program of the function "vhfprintf" runs on VH.
|-- run.sh: Execution script.
|-- vhcalltest.c: Source program of the main program.
|-- vhcalltest_before.c: Souce program of the original main program without vhcall.
`-- vhcalltest_simple.c: Souce program of simplified main program. Return values are not checks and easy to see.
You can make/run it as follows.
$ tar -zxvf FMTIOwithVHCALL.tgz
$ cd ./FMTIOwithVHCALL/lib/
$ make # ".so" file is made with gcc.
$ cd ../
$ make # Main program is made with ncc.
$ ./run.sh
This program performs formatted output 20,971,152 elements of double precision values into "out" file.
The followings are "proginf" output without/with vhcall.
Without With
vhcall vhcall
Real Time (sec) : 14.606438 0.970042
User Time (sec) : 14.561287 0.000285
Vector Time (sec) : 0.000053 0.000053
Inst. Count : 13769140634 127989
V. Inst. Count : 16384 16384
V. Element Count : 4194304 4194304
V. Load Element Count : 0 0
FLOP Count : 0 0
MOPS : 962.268177 25886.079776
MOPS (Real) : 942.962328 4.438896
MFLOPS : 0.000000 0.000000
MFLOPS (Real) : 0.000000 0.000000
A. V. Length : 256.000000 256.000000
V. Op. Ratio (%) : 0.030452 97.408097
L1 Cache Miss (sec) : 2.279239 0.000069
CPU Port Conf. (sec) : 0.000000 0.000000
V. Arith. Exec. (sec) : 0.000052 0.000052
V. Load Exec. (sec) : 0.000000 0.000000
VLD LLC Hit Element Ratio (%) : 0.000000 0.000000
FMA Element Count : 0 0
Power Throttling (sec) : 0.000000 0.000000
Thermal Throttling (sec) : 0.000000 0.000000
Memory Size Used (MB) : 298.000000 298.000000
Non Swappable Memory Size Used (MB) : 86.000000 86.000000
It becomes much faster form 14.6 seconds to 0.96 seconds.
Original program (without vhcall) is as follows (source code "vhcalltest_before.c").
1 #include <stdio.h>
2 #include <stdlib.h>
3
4 #define ARRAYSIZE (524288*4)
5
6 int main(void){
7 double a[ARRAYSIZE];
8 int i;
9 FILE *fp;
10
11 for(i=0; i<ARRAYSIZE; i++) a[i] = 999.0;
12
13 fp = fopen("out", "w");
14 for(i=0; i<ARRAYSIZE; i++) fprintf(fp, "%d %le\n", i, a[i]);
15 fclose(fp);
16
17 return EXIT_SUCCESS;
18 }
19
Operations on line 13~15 is offloaded.
Soruce codes added to utilize vhcall is the part on "vhcalltest.c" which becomes enabled when "USEVHCALL" is defined.
- VH Call library, "libvhacall.so" is enabled with "vhcall_install".
- Get a "handler" to the function runs on VH with "vhcall_find".
- Allocate the structure for the arguments of VH Call function with "vhcall_args_alloc".
- Specify arguments with "vhcall_args_set_pointer", "vhcall_args_set_i32" and etc. Here, the first (0-th) argument is the filename, the second (1-st) is number of elements and the third (2-nd) is the target array "a".
- After that, execute offloaded function with "vhcall_invoke_with_args".
- Cleanup/finalize with "vhcall_args_free" and "vhcall_uninstall".
"./lib/" is the make environment of shared library "vhcalltest.c".
To utilize VH Call, enable "-DUSEVHCALL" on Makefile.
Posted by Tkato on 26 September 2022 at 09:08. Edited by Tkato on 26 September 2022 at 09:20. |
|