Forum - STREAM Benchmark Results for different Vector Lengths

Jump to navigation Jump to search
Overview > Topics > Others > STREAM Benchmark Results for different Vector Lengths
[#91]

Hello,

im am currently in the process of investigating the relationship between vector length and memory bandwidth on the SX-Aurora. For that purpose I changed the for loops of the various kernels to include the vector length. It then looks like this:
for (j=0; j<STREAM_ARRAY_SIZE; j+=VLEN){

 #pragma _NEC nounroll
 #pragma _NEC vector_threshold(1)
 for(int i = 0; i < VLEN; ++i){
   Copy, Scale, Add or Triad Kernel here
 }

}
What I initially expected was a linear relationship between the bandwidth and vector length. Meaning that for a length of 256 I can reach the official ~1229 GB/s or a value close to it. And then for example for length 64 I expected 1/4 of that peak or the measured value.
But the results are very much different. You can see them in the following table

Stream Benchmark Results, Array Size of 2.2 GiB, LLC Hitrate <0.01%
Vector Length Copy GiB Scale GiB Add GiB Triad GiB Best result converted to GB Best as % of 1229GB/s Best as % of real value at 256 Linear Relationship assumes GB/s
1 21.0967 20.1343 24.5327 26.8802 28.8624 2.3484 2.7469 4.8
2 43.9901 45.0213 54.4394 54.7388 58.7753 4.7824 5.5938 9.6
4 94.2324 94.4371 106.0623 107.2386 115.1466 9.3691 10.9588 19.2
8 182.8993 179.9025 182.5055 191.794 204.9372 16.675 19.5997 38.4
16 309.3943 302.7845 324.9520 316.2976 348.9146 28.3901 33.2072 76.8
32 451.3887 448.9818 485.1928 495.0939 531.603 43.2549 50.5942 153.6
64 754.8973 730.8980 896.0194 896.5940 962.7105 78.3328 91.624 307.2
128 984.1308 983.8855 977.8357 980.4262 1056.7024 85.9867 100.5695 614.4
256 958.9995 959.8635 973.6647 978.5581 1050.718 65.4938 100 1229

I have validated these results over multiple runs of the benchmarks, they only vary slightly. And I am now left with the question as to what causes these results. The official documentation of the hardware and memory systems gives no hints as to why this happens. And for my work I have to find the true reason to explain this behaviour. Which means I now have the following questions:

What causes this nonlinear relationship between memory bandwidth and vector length? There are probably hardware reasons for it.
And how can it be modeled realistically?

Thank you for your help

Posted by CPTSulu on 17 October 2022 at 10:17.
Edited by CPTSulu on 17 October 2022 at 10:21.