Question

只有当循环在AVX机器（Intel（R）Core（TM）i5-3570K CPU @ 3.40GHz）上完全矢量化时，我才会在循环中出现seg故障。

使用gcc编译-c -march = native MyClass.cpp -O3 -ftree-vectorizer-verbose = 6

我正在尝试对齐数组，以避免来自-ftree-vectorizer-verbose = 6的这些消息：

MyClass.cpp:352: note: dependence distance modulo vf == 0 between this_7(D)->x[i_101] and this_7(D)->x[i_101]
MyClass.cpp:352: note: vect_model_load_cost: unaligned supported by hardware.
MyClass.cpp:352: note: vect_get_data_access_cost: inside_cost = 2, outside_cost = 0.
MyClass.cpp:352: note: vect_model_store_cost: unaligned supported by hardware.
MyClass.cpp:352: note: vect_get_data_access_cost: inside_cost = 2, outside_cost = 0.
MyClass.cpp:352: note: Alignment of access forced using peeling.
MyClass.cpp:352: note: vect_model_load_cost: aligned.
MyClass.cpp:352: note: vect_model_load_cost: inside_cost = 1, outside_cost = 0 .
MyClass.cpp:352: note: vect_model_simple_cost: inside_cost = 1, outside_cost = 1 .
MyClass.cpp:352: note: vect_model_store_cost: aligned.
MyClass.cpp:352: note: vect_model_store_cost: inside_cost = 1, outside_cost = 0 .
MyClass.cpp:352: note: cost model: prologue peel iters set to vf/2.
MyClass.cpp:352: note: cost model: epilogue peel iters set to vf/2 because peeling for alignment is unknown .

我想看到（并且确实看到）的是：

MyClass.cpp:352: note: dependence distance modulo vf == 0 between this_7(D)->x[i_101] and this_7(D)->x[i_101]
MyClass.cpp:352: note: vect_model_load_cost: aligned.
MyClass.cpp:352: note: vect_get_data_access_cost: inside_cost = 1, outside_cost = 0.
MyClass.cpp:352: note: vect_model_store_cost: aligned.
MyClass.cpp:352: note: vect_get_data_access_cost: inside_cost = 2, outside_cost = 0.
MyClass.cpp:352: note: vect_model_load_cost: aligned.
MyClass.cpp:352: note: vect_model_load_cost: inside_cost = 1, outside_cost = 0 .
MyClass.cpp:352: note: vect_model_simple_cost: inside_cost = 1, outside_cost = 1 .
MyClass.cpp:352: note: vect_model_store_cost: aligned.
MyClass.cpp:352: note: vect_model_store_cost: inside_cost = 1, outside_cost = 0 .

现在，我不是任何一个C / C ++ / Assembler大师，但是当我遇到seg错误时，我认为我的代码中有一些指针/数组/其他搞砸，而完全向量化的循环只是暴露了这个。但经过两天学习汇编程序后，我无法追踪它。所以我在这里。

代码看起来像这样（希望我能包括所有相关内容 - 我不能在这里完全分享实际的.cpp）：

class MyClass {

private:
    static const long maxElems = 1024;
    static const double otherVar = 0.9;
    double x[maxElems] __attribute__ ((aligned (32)));  <-- gcc reports fully vectorized
    //double x[maxElems];   <-- leads to unaligned peeling

public:
    void myFunc() {
        // Always works
        for (int i=0; i<maxElems; ++i) printf("Test: %d %.4e\n", i, x[i]);

        // Seg fault if fully vectorized (no peeling)
        for (int i=0; i<maxElems; ++i) {
            x[i] = x[i] - 42;
        } 

        // Works if no seg fault earlier
        for (int i=0; i<maxElems; ++i) printf("Test: %d %.4e\n", i, x[i]);
    }
}

当它完全向量化时，我看到（使用-Wa，-alh标志来查看汇编程序）：

 989      00
 990 0b56 488B4424      movq    40(%rsp), %rax
 990      28
 991 0b5b C5FD280D      vmovapd .LC8(%rip), %ymm1
 991      00000000 
 992                    .p2align 4,,10
 993 0b63 0F1F4400      .p2align 3
 993      00
 994                .L153:
 995 0b68 C5FD2800      vmovapd (%rax), %ymm0
 996 0b6c C5FD5CC1      vsubpd  %ymm1, %ymm0, %ymm0
 997 0b70 C5FD2900      vmovapd %ymm0, (%rax)
 998 0b74 4883C020      addq    $32, %rax
 999 0b78 4C39E0        cmpq    %r12, %rax
 1000 0b7b 75EB             jne .L153

再一次，关于＆＃34;不知道汇编程序＆＃34;但我确实花了相当多的时间打印指针并检查汇编程序以说服自己这个循环在数组的开始和结束时开始和结束。但是当我得到seg故障时，x的起始地址不能被32整除。我认为这是造成麻烦的原因。

是的，我知道我可以在堆上分配x并选择它最终的位置以使其对齐。但我在这里的部分实验是让MyClass具有固定大小，内部包含所有数据（想想：缓存效率），所以我在堆上分配了MyClass实例，在集合中指向它们，x在MyClass中。

是否将属性对齐以将x放在32字节边界上？编译器假设，然后vmovapd正在爆炸，因为它不是，对吧？

关于路线的GCC文件：https://gcc.gnu.org/onlinedocs/gcc/Variable-Attributes.html

我是否必须以某种方式在堆上对齐MyClass？我怎么做？我如何告诉GCC我这样做，所以它像我想要的那样矢量化？

编辑：我已经解决了这个问题（部分归功于下面的评论和答案）。通过覆盖默认的new运算符，可以保证在堆上创建对象时的对齐。当我这样做时，我没有没有seg错误，我的代码仍然完全按照我的要求进行了矢量化。我是怎么做到的：

static void* operator new(size_t size) throw (std::bad_alloc) {
    void *alignedPointer;
    int alignError = 0;

    // Try to allocate the required amount of memory (using POSIX standard aligned allocation)
    alignError = posix_memalign(&alignedPointer, VECTOR_ALIGN_BYTES, size);

    // Throw/Report error if any
    if (alignError) {
        throw std::bad_alloc();
    }

    // Return a pointer to this aligned memory location
    return alignedPointer;
}

static void operator delete(void* alignedPointer) {
    // POSIX aligned memory allocation can be freed normally with free()
    free(alignedPointer);
}

C ++在调用运算符之后/之前为您调用构造函数/析构函数。因此，对齐由类本身控制。如果您有不同的首选项，还有其他对齐的内存分配器。我使用POSIX。

两个警告：如果有人使用任意地址拨打placement new，您仍然不会对齐。如果有人将您的类声明为其类的成员，并且他们的类已在堆上分配，则可能是未对齐的。我已经检查了我的构造函数并在检测到此错误时抛出错误。

Answer 1

__attribute__((aligned(32))

可能不会做我们认为的事情（bug？功能？）。

它基本上告诉编译器它可以假设这个东西是对齐的，它可能不是。如果它在堆上，则需要使用posix_memalign或类似的分配。

如果__attribute__((aligned(...))已设置但分配未对齐，GCC实际上会错误地指针算法。

s2->aligned_var = 0x199c030
&s2->aligned_var % 0x40  = 0x0

https://gcc.gnu.org/ml/gcc/2014-06/msg00308.html

AVX矢量化代码中的段故障，GCC attribute以32字节对齐

1 个答案:

AVX矢量化代码中的段故障，GCC __attribute__以32字节对齐

1 个答案:

AVX矢量化代码中的段故障，GCC attribute以32字节对齐