Question

我尝试使用g ++ 5.4（-ftree-vectorize）进行自动矢量化。我注意到下面代码中的数组版本导致编译器错过内部循环中的向量化机会，导致与指针版本相比显着的性能差异。在这种情况下，有什么办法可以帮助编译器吗？

void floydwarshall(float* mat, size_t n) {
#if USE_POINTER
    for (int k = 0; k < n; ++k) {
        for (int i = 0; i < n; ++i) {
            auto v = mat[i*n + k];
            for (int j = 0; j < n; ++j) {
                auto val = v + mat[k*n+j];
                if (mat[i*n + j] > val) {
                    mat[i*n + j] = val;
                }
            }
        }
    }
#else // USE_ARRAY
    typedef float (*array)[n];
    array m = reinterpret_cast<array>(mat);
    for (int k = 0; k < n; ++k) {
        for (int i = 0; i < n; ++i) {
            auto v = m[i][k];
            for (int j = 0; j < n; ++j) {
                auto val = v + m[k][j];
                if (m[i][j] > val) {
                    m[i][j] = val;
                }
            }
        }
    }
#endif
}

Answer 1

两个版本使用g ++ 5.4 -O3 -march=haswell进行矢量化，使用内部循环中的vcmpltps / vmaskmovps，因为Marc指出了这一点。

如果你不让编译器使用AVX指令，那将会更难。但是如果我只使用-O3，我根本看不到任何一个版本矢量化（所以只有SSE2可用，因为它是x86-64的基线）。所以你的原始问题是基于我无法重现的结果。

将if（）更改为三元运算符（因此代码始终存储到数组中）允许编译器加载/ MINPS /无条件存储。如果您的矩阵不适合缓存，这会占用大量内存;也许你可以用不同的方式安排你的循环？或者可能不是，因为需要m[i][k]，我认为事情发生的顺序很重要。

如果更新很少发生并且脏数据的回写导致了内存瓶颈，那么如果没有修改任何向量元素，甚至可能需要进行分支以避免存储。

这是一个矢量化很好的数组版本，即使只是SSE2。我添加了代码来告诉编译器输入是对齐的，大小是8的倍数（每个AVX向量的浮点数）。如果您的真实代码无法做出这些假设，那么请将该部分删除。它使矢量化部分更容易找到，因为它没有隐藏在标量简介/清理代码中。（使用-O2 -ftree-vectorize并不能以这种方式完全展开清理代码，但是-O3会这样做。）

我注意到没有AVX，gcc仍然使用未对齐的加载但对齐的商店。也许它没有意识到如果m[k][j]对齐，大小是8的倍数应该m[i][j]对齐？这可能是指针版本和数组版本之间的区别。

code on the Godbolt compiler explorer

void floydwarshall_array_unconditional(float* mat, size_t n) { // try to tell gcc that it doesn't need scalar intro/outro code // The vectorized inner loop isn't particularly different without these, but it means less wading through scalar cleanup code (and less bloat if you can use this in your real code). // works with gcc6, doesn't work with gcc5.4 mat = (float*)__builtin_assume_aligned(mat, 32); n /= 8; n *= 8; // code is simpler if matrix size is always a multiple of 8 (floats per AVX vector) typedef float (*array)[n]; array m = reinterpret_cast<array>(mat); for (size_t k = 0; k < n; ++k) { for (size_t i = 0; i < n; ++i) { auto v = m[i][k]; for (size_t j = 0; j < n; ++j) { auto val = v + m[k][j]; m[i][j] = (m[i][j]>val) ? val : m[i][j]; // Marc's suggested change: enables vectorization with unconditional stores. } } } }

gcc5.4无法避免向量化部分周围的标量介绍/清理代码，但gcc6.2可以。两个编译器版本的矢量化部分基本相同。

## The inner-most loop (with gcc6.2 -march=haswell -O3) .L5: vaddps ymm0, ymm1, YMMWORD PTR [rsi+rax] vminps ymm0, ymm0, YMMWORD PTR [rdx+rax] #### Note use of minps and unconditional store, enabled by using the ternary operator instead of if(). add r14, 1 vmovaps YMMWORD PTR [rdx+rax], ymm0 add rax, 32 cmp r14, r13 jb .L5

外面的下一个循环执行一些整数计数器检查（使用一些setcc的东西），并vmovss xmm1, DWORD PTR [rax+r10*4]和单独的vbroadcastss ymm1, xmm1。据推测，它跳转到的标量清理不需要广播，而且即使不需要广播部分，gcc也不知道将VBROADCASTSS用作负载会更便宜。

gcc中的数组与指针自动向量化

1 个答案: