Forum - Peak Performance and FMA Unit Utilization

Jump to navigation Jump to search
Overview > Topics > Others > Peak Performance and FMA Unit Utilization
[#92]

Hello,

I tried to create a Kernel that reaches the theoretical DP peak performance. The Aurora has three FMA Units per Pipeline that can execute independent operations. Meaning that a single FMA Operation in a Kernel should theoretically reach 1/3 of the peak performance, two Operations 2/3 and three 100%.

But I noticed something peculiar. A kernel with one FMA Operation reaches 2/3 of peak performance. A Kernel with two FMA Operations reaches ~80% and one with three then reaches the peak.

The code I used is the following:
double a[VLEN];

double c[VLEN];

double aa[VLEN];

double cc[VLEN];

double aaa[VLEN];

double ccc[VLEN];

for(int i = 0; i < VLEN; ++i){

 a[i] = (double)i;
 c[i] = (double)i;
 aa[i] = (double)i;
 cc[i] = (double)i;
 aaa[i] = (double)i;
 ccc[i] = (double)i;

}


for (long i = 0; i < n; ++i) {

 #pragma _NEC nounroll
 #pragma _NEC vector_threshold(1)
 for (int v = 0; v < VLEN; ++v){
   a[i] += c[v] * i;
   aa[i] += cc[v] * i;
   aaa[v] += ccc[v] * i;
 }

}
For one FMA Operation I would comment out the other two and so on.

Is there an explanation for this behaviour? Like for example clever scheduling?

Thank you for your help

Posted by CPTSulu on 17 October 2022 at 10:40.
Edited by CPTSulu on 17 October 2022 at 11:09.

Hello,

the example shows that register renaming and vector instruction chaining work well. (BTW I'd suppose in the innermost loop you have a[v], aa[v] instead of a[i], aa[i])

If we consider the single FMA inner loop case, the entire innermost loop consists of only one VFMAD instruction. The arrays a[], c[] live in vector registers, they don't need to be loaded/stored inside the loops. Let's suppose that a[] went into %v63 and c[] went into %v62. The outer loop then does something like this:

  • convert i to double and store it into a scalar register (say %s17)
  • VFMAD %v63, %v63, %s17, %v62
  • increment i, check if lower than n, loop if yes.

This loop is executed on the scalar processing unit. When the VFMAD instruction is encountered, it is issued to the vector processing unit (VPU) and the scalar outer loop continues without waiting for the vector instruction to finish. This way we get a bunch of instructions issued to the VPU:
%v63' <-- %v63 + %s17 * %v62 # %v63' actually means a different physical vreg that %v63, this is a renamed vreg. %s17 contains i=0.
%v63'' <-- %v63' + %s17' * %v62 # %v63'' is again a renamed version of %v63'. %s17' is a renamed %s17 and contains i=1.
%v63''' <-- %v63'' + %s17'' * %v62 # %s17'' contains i=2.
etc...

Once the first instruction has produced a slice of 32 results in %v63', the second line can start computing! This is called vector instruction chaining. Once VFMAD instruction has some startup latency and takes 8 cycles to process the 256 length vector registers. If both FMA lines access the same %v62 (why should it be copied?) then at some point there might be a limitation to access %v62 content from multiple chained instructions. Because the number of memory ports to the registers is limited. Anyway, the fact that you see 2/3 of peak means that at a given time we have 2 of these instructions doing useful work!

If you now use two or more FMA lines inside the inner loop you get two or three independent streams of instructions that can saturate the three FMA units better. They don't depend on the same %v62.

Thank you for this simple example that achieves peak performance!

Posted by Mr.sp0ck on 24 October 2022 at 07:48.