不幸的是,英特尔编译器无法对以下i和j循环进行矢量化:
#pragma ivdep
#pragma vector always
for ( i = 1 ; i < N ; i++ ){ // <- not vectorized
for ( j = 0 ; j < i ; j++ ){ // <- not vectorized
// Matrix multiplication on C = x(i) * x(j)
cblas_dgemm(...,&x[i*BLOCKSIZE*D],... , &x[j*BLOCKSIZE*D],..., c);
int accum=0;
// following loop gets vectorized well
#pragma omp parallel for reduction(+:accum) collapse(2)
for ( int k = 0 ; k < BLOCKSIZE ; k++ ){
for ( int l =0 ; l < BLOCKSIZE ; l++ ){
accum += C[k * NRC + l] + p[j*BLOCKSIZE + l] + p[i*BLOCKSIZE+k];
}
}
total += accum;
}
}
矢量化报告说:
LOOP BEGIN at i-th loop:
remark #15521: loop was not vectorized: loop control variable was not identified. Explicitly compute the iteration count before executing the loop or try using canonical loop form from OpenMP specification
LOOP BEGIN at j-th loop:
remark #15521: loop was not vectorized: loop control variable was not identified. Explicitly compute the iteration count before executing the loop or try using canonical loop form from OpenMP specification
LOOP END
LOOP END
我真的很困惑,因为我认为控制变量i
和j
是显而易见的,我认为我有来自OpenMP规范的cannonical循环形式。 k
- 和l
- 循环可以正常工作。有什么猜测吗?