如何使用GCC自动矢量化跨步写入?

时间:2015-10-17 23:36:47

标签: c performance gcc vectorization c99

使用-std=c99-O3-mavx2使用GCC 5.2进行编译时, 以下代码示例自动向量化(assembly here):

#include <stdint.h>

void test(uint32_t *restrict a,
          uint32_t *restrict b) {
  uint32_t *a_aligned = __builtin_assume_aligned(a, 32);
  uint32_t *b_aligned = __builtin_assume_aligned(b, 32);

  for (int i = 0; i < (1L << 10); i += 2) {
    a_aligned[i] = 42 * b_aligned[i];
    a_aligned[i+1] = 3 * a_aligned[i+1];
  }
}

但以下代码示例不会自动向量化(assembly here):

#include <stdint.h>

void test(uint32_t *restrict a,
          uint32_t *restrict b) {
  uint32_t *a_aligned = __builtin_assume_aligned(a, 32);
  uint32_t *b_aligned = __builtin_assume_aligned(b, 32);

  for (int i = 0; i < (1L << 10); i += 2) {
    a_aligned[i] = 42 * b_aligned[i];
    a_aligned[i+1] = a_aligned[i+1];
  }
}

样本之间的唯一区别是缩放因子为a_aligned[i+1]

GCC 4.8,4.9和5.1也是如此。将volatile添加到a_aligned的声明会完全禁止自动矢量化。对于我们来说,第一个样本的运行速度始终高于第二个样本,对于较小的类型,速度更快(例如uint8_t而不是uint32_t)。

有没有办法让第二个代码示例使用GCC进行自动向量化?

1 个答案:

答案 0 :(得分:1)

以下版本的矢量化,但如果你问我那就太丑了......

#include <stdint.h>

void test(uint32_t *a, uint32_t *aa,
          uint32_t *restrict b) {
  #pragma omp simd aligned(a,aa,b:32)
  for (int i = 0; i < (1L << 10); i += 2) {
    a[i] = 2 * b[i];
    a[i+1] = aa[i+1];
  }
}

使用-fopenmp进行编译并使用test(a, a, b)进行调用。