Hello,
I tried to create a Kernel that reaches the theoretical DP peak performance. The Aurora has three FMA Units per Pipeline that can execute independent operations. Meaning that a single FMA Operation in a Kernel should theoretically reach 1/3 of the peak performance, two Operations 2/3 and three 100%.
But I noticed something peculiar. A kernel with one FMA Operation reaches 2/3 of peak performance. A Kernel with two FMA Operations reaches ~80% and one with three then reaches the peak.
The code I used is the following:
double a[VLEN];
double c[VLEN];
double aa[VLEN];
double cc[VLEN];
double aaa[VLEN];
double ccc[VLEN];
for(int i = 0; i < VLEN; ++i){
a[i] = (double)i;
c[i] = (double)i;
aa[i] = (double)i;
cc[i] = (double)i;
aaa[i] = (double)i;
ccc[i] = (double)i;
}
for (long i = 0; i < n; ++i)
{
#pragma _NEC nounroll
#pragma _NEC vector_threshold(1)
for (int v = 0; v < VLEN; ++v){
a[i] += c[v] * i;
aa[i] += cc[v] * i;
aaa[v] += ccc[v] * i;
}
}
For one FMA Operation I would comment out the other two and so on.
Is there an explanation for this behaviour? Like for example clever scheduling?
Thank you for your help
Posted by CPTSulu on 17 October 2022 at 10:40. Edited by CPTSulu on 17 October 2022 at 11:09. |
|