Hello,
im am currently in the process of investigating the relationship between vector length and memory bandwidth on the SX-Aurora. For that purpose I changed the for loops of the various kernels to include the vector length.
It then looks like this:
for (j=0; j<STREAM_ARRAY_SIZE; j+=VLEN){
#pragma _NEC nounroll
#pragma _NEC vector_threshold(1)
for(int i = 0; i < VLEN; ++i){
Copy, Scale, Add or Triad Kernel here
}
}
What I initially expected was a linear relationship between the bandwidth and vector length. Meaning that for a length of 256 I can reach the official ~1229 GB/s or a value close to it. And then for example for length 64 I expected 1/4 of that peak or the measured value.
But the results are very much different. You can see them in the following table
Stream Benchmark Results, Array Size of 2.2 GiB, LLC Hitrate <0.01%
Vector Length |
Copy GiB |
Scale GiB |
Add GiB |
Triad GiB |
Best result converted to GB |
Best as % of 1229GB/s |
Best as % of real value at 256 |
Linear Relationship assumes GB/s
|
1 |
21.0967 |
20.1343 |
24.5327 |
26.8802 |
28.8624 |
2.3484 |
2.7469 |
4.8
|
2 |
43.9901 |
45.0213 |
54.4394 |
54.7388 |
58.7753 |
4.7824 |
5.5938 |
9.6
|
4 |
94.2324 |
94.4371 |
106.0623 |
107.2386 |
115.1466 |
9.3691 |
10.9588 |
19.2
|
8 |
182.8993 |
179.9025 |
182.5055 |
191.794 |
204.9372 |
16.675 |
19.5997 |
38.4
|
16 |
309.3943 |
302.7845 |
324.9520 |
316.2976 |
348.9146 |
28.3901 |
33.2072 |
76.8
|
32 |
451.3887 |
448.9818 |
485.1928 |
495.0939 |
531.603 |
43.2549 |
50.5942 |
153.6
|
64 |
754.8973 |
730.8980 |
896.0194 |
896.5940 |
962.7105 |
78.3328 |
91.624 |
307.2
|
128 |
984.1308 |
983.8855 |
977.8357 |
980.4262 |
1056.7024 |
85.9867 |
100.5695 |
614.4
|
256 |
958.9995 |
959.8635 |
973.6647 |
978.5581 |
1050.718 |
65.4938 |
100 |
1229
|
I have validated these results over multiple runs of the benchmarks, they only vary slightly. And I am now left with the question as to what causes these results. The official documentation of the hardware and memory systems gives no hints as to why this happens. And for my work I have to find the true reason to explain this behaviour. Which means I now have the following questions:
What causes this nonlinear relationship between memory bandwidth and vector length? There are probably hardware reasons for it.
And how can it be modeled realistically?
Thank you for your help
Posted by CPTSulu on 17 October 2022 at 10:17. Edited by CPTSulu on 17 October 2022 at 10:21. |
|