在我的C程序中,我计算运行Windows 10 Home的64位Intel Corei5-2410M Sandy Bridge机器上的时钟周期数,但有些奇怪。我在版本构建中使用Code :: Blocks(CB)16.01在-O2和-O3编译程序。对于-O2,时钟周期是可以的,但-O3返回0个周期。目前,我还没有考虑使用turbo-boost和超线程,但肯定会在以后禁用它们。
我使用以下命令进行编译
mingw32-gcc.exe -Wall -O2 -m32 -IC:\GMP\include -c "E:\abc\main.c" -o obj\Release\main.o
mingw32-gcc.exe -Wall -O3 -m32 -IC:\GMP\include -c "E:\abc\main.c" -o obj\Release\main.o
我们有
void schoolbook_9(int32_t *X, int32_t *Y, int64_t *Z){
Z[0] = (int64_t)X[0]*Y[0] + (int64_t)X[1]*Y[1] + (int64_t)X[2]*Y[2] + (int64_t)X[3]*Y[3] + (int64_t)X[4]*Y[4] + (int64_t)X[5]*Y[5] + (int64_t)X[6]*Y[6] + (int64_t)X[7]*Y[7] + (int64_t)X[8]*Y[8];
Z[1] = (int64_t)X[9]*Y[0] + (int64_t)X[0]*Y[1] + (int64_t)X[1]*Y[2] + (int64_t)X[2]*Y[3] + (int64_t)X[3]*Y[4] + (int64_t)X[4]*Y[5] + (int64_t)X[5]*Y[6] + (int64_t)X[6]*Y[7] + (int64_t)X[7]*Y[8];
Z[2] = (int64_t)X[10]*Y[0] + (int64_t)X[9]*Y[1] + (int64_t)X[0]*Y[2] + (int64_t)X[1]*Y[3] + (int64_t)X[2]*Y[4] + (int64_t)X[3]*Y[5] + (int64_t)X[4]*Y[6] + (int64_t)X[5]*Y[7] + (int64_t)X[6]*Y[8];
Z[3] = (int64_t)X[11]*Y[0] + (int64_t)X[10]*Y[1] + (int64_t)X[9]*Y[2] + (int64_t)X[0]*Y[3] + (int64_t)X[1]*Y[4] + (int64_t)X[2]*Y[5] + (int64_t)X[3]*Y[6] + (int64_t)X[4]*Y[7] + (int64_t)X[5]*Y[8];
Z[4] = (int64_t)X[12]*Y[0] + (int64_t)X[11]*Y[1] + (int64_t)X[10]*Y[2] + (int64_t)X[9]*Y[3] + (int64_t)X[0]*Y[4] + (int64_t)X[1]*Y[5] + (int64_t)X[2]*Y[6] + (int64_t)X[3]*Y[7] + (int64_t)X[4]*Y[8];
Z[5] = (int64_t)X[13]*Y[0] + (int64_t)X[12]*Y[1] + (int64_t)X[11]*Y[2] + (int64_t)X[10]*Y[3] + (int64_t)X[9]*Y[4] + (int64_t)X[0]*Y[5] + (int64_t)X[1]*Y[6] + (int64_t)X[2]*Y[7] + (int64_t)X[3]*Y[8];
Z[6] = (int64_t)X[14]*Y[0] + (int64_t)X[13]*Y[1] + (int64_t)X[12]*Y[2] + (int64_t)X[11]*Y[3] + (int64_t)X[10]*Y[4] + (int64_t)X[9]*Y[5] + (int64_t)X[0]*Y[6] + (int64_t)X[1]*Y[7] + (int64_t)X[2]*Y[8];
Z[7] = (int64_t)X[15]*Y[0] + (int64_t)X[14]*Y[1] + (int64_t)X[13]*Y[2] + (int64_t)X[12]*Y[3] + (int64_t)X[11]*Y[4] + (int64_t)X[10]*Y[5] + (int64_t)X[9]*Y[6] + (int64_t)X[0]*Y[7] + (int64_t)X[1]*Y[8];
Z[8] = (int64_t)X[16]*Y[0] + (int64_t)X[15]*Y[1] + (int64_t)X[14]*Y[2] + (int64_t)X[13]*Y[3] + (int64_t)X[12]*Y[4] + (int64_t)X[11]*Y[5] + (int64_t)X[10]*Y[6] + (int64_t)X[9]*Y[7] + (int64_t)X[0]*Y[8];}
我按如下方式计算时钟周期
int32_t X[17], Y[9];
int64_t Z[9];
utype64 start, end;
uint32_t i;
srand(time(NULL));
for(i=0; i<17; i++)
X[i] = rand()%(uint32_t)pow(2.0, 29);
srand(time(NULL));
for(i=0; i<9; i++)
Y[i] = rand()%(uint32_t)pow(2.0, 29);
start=rdtsc();
end=rdtscp();
start=rdtsc();
for(i=0; i<10000000; i++)
schoolbook_9(X, Y, Z);
end=rdtscp();
printf("\n%s%"PRIu64"\n", "The cycles count using SB of size 9 is :: ", (end-start)/10000000);
我使用rdtscp指令是因为我的系统支持它并且可能在32位机器上不可用,因此,我已经使用/ out rdtscp测试了我的程序。参数X,Y和Z是X和Y为32位且Z为64位的数组。
所以,我的问题是如何让循环计数为-O3?因为对于当前代码,我得到0周期。
flage -ftree-loop-vectorize 设置为-O3,如本页https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html所述。这是否意味着循环已被矢量化?如果是,那么如何确定向量的长度(4个元素,6个元素等)是什么?
答案 0 :(得分:1)
这是因为end - start
低于10000000
-O3
。您的部门会生成0
。
utype64 result = end - start;
utype64 cycle = 10000000;
utype64 total = result / cycle;
utype64 rest = result % cycle;
printf("The cycles count using SB of size 9 is " PRIu64
" and the rest is " PRIu64 "\n",
total, rest);
你不应该拨打两次srand(time(NULL));
。它没用,可以产生奇怪的行为。
注意:我无法测试自己。