我想通过AVX2指令对a[i] = a[i-1] +c
进行矢量化。由于依赖性,它似乎无法矢量化。我已经矢量化了,想在这里分享答案,看看这个问题是否有更好的答案,或者我的解决方案是好的。
答案 0 :(得分:1)
我已经实现了以下功能来进行矢量化,看起来还可以!加速比gcc -O3
快2.5倍
这是解决方案:
// vectorized
inline void vec(int a[LEN], int b, int c)
{
// b=1 and c=2 in this case
int i = 0;
a[i++] = b;//0 --> a[0] = 1
//step 1:
//solving dependencies vectorization factor is 8
a[i++] = a[0] + 1*c; //1 --> a[1] = 1 + 2 = 3
a[i++] = a[0] + 2*c; //2 --> a[2] = 1 + 4 = 5
a[i++] = a[0] + 3*c; //3 --> a[3] = 1 + 6 = 7
a[i++] = a[0] + 4*c; //4 --> a[4] = 1 + 8 = 9
a[i++] = a[0] + 5*c; //5 --> a[5] = 1 + 10 = 11
a[i++] = a[0] + 6*c; //6 --> a[6] = 1 + 12 = 13
a[i++] = a[0] + 7*c; //7 --> a[7] = 1 + 14 = 15
// vectorization factor reached
// 8 *c will work for all
//loading the results to an vector
__m256i dep1, dep2; // dep = { 1, 3, 5, 7, 9, 11, 13, 15 }
__m256i coeff = _mm256_set1_epi32(8*c); //coeff = { 16, 16, 16, 16, 16, 16, 16, 16 }
for(; i<LEN-1; i+=16){
dep1 = _mm256_load_si256((__m256i *) &a[i-8]);
dep1 = _mm256_add_epi32(dep1, coeff);
_mm256_store_si256((__m256i *) &a[i], dep1);
dep2 = _mm256_load_si256((__m256i *) &a[i]);
dep2 = _mm256_add_epi32(dep2, coeff);
_mm256_store_si256((__m256i *) &a[i+8], dep2);
}
}