Question

为什么通过使用__m256数据类型获得如此巨大的加速（x16倍）？一次处理8个浮点，所以我希望只能看到x8加速？

我的CPU是4核Devil Canyon i7（具有超线程）在发布模式下使用Visual Studio 2017进行编译-O2优化已打开。

快速版本在400x400矩阵上耗时0.000151秒：

//make this matrix only keep the signs of its entries
inline void to_signs() {

    __m256 *i = reinterpret_cast<__m256*>(_arrays);
    __m256 *end = reinterpret_cast<__m256*>(_arrays + arraysSize());

    __m256 maskPlus = _mm256_set1_ps(1.f);
    __m256 maskMin =  _mm256_set1_ps(-1.f);

    //process the main portion of the array.  NOTICE: size might not be divisible by 8:
    while(true){
        ++i;
        if(i > end){  break; }

        __m256 *prev_i = i-1;
        *prev_i = _mm256_min_ps(*prev_i, maskPlus);
        *prev_i = _mm256_max_ps(*prev_i, maskMin);
    }

    //process the few remaining numbers, at the end of the array:
    i--;
    for(float *j=(float*)i; j<_arrays+arraysSize(); ++j){
        //taken from here:http://www.musicdsp.org/showone.php?id=249
        // mask sign bit in f, set it in r if necessary:
        float r = 1.0f;
        (int&)r |= ((int&)(*j) & 0x80000000);//according to author, can end up either -1 or 1 if zero.
        *j = r;
    }
}

较早的版本，运行时间为0.002416秒：

inline void to_signs_slow() {
    size_t size = arraysSize();

    for (size_t i = 0; i<size; ++i) {
        //taken from here:http://www.musicdsp.org/showone.php?id=249
        // mask sign bit in f, set it in r if necessary:

        float r = 1.0f;
        (int&)r |= ((int&)_arrays[i] & 0x80000000);//according to author, can end up either -1 or 1 if zero.
        _arrays[i] = r;
    }
}

是否秘密使用2个内核，所以一旦我开始使用多线程，这种好处就会消失吗？

编辑：

在尺寸为（10e6）x（4e4）的较大矩阵上，我平均得到3秒和14秒。因此只有x4的加速，甚至没有x8 This is probably due to memory bandwidth, and things not fitting in cache

仍然，我的问题是关于令人愉快的x16加速惊喜：）

Answer 1

您的标量版本看起来很可怕（带有用于类型转换的参考广播），并且可能编译为真正效率低下的asm，这比将每个32位元素复制到1.0f的位模式中要慢得多。那应该只用一个整数AND和一个OR来进行标量处理（如果MSVC无法为您自动将其矢量化），但是如果编译器将其复制到XMM寄存器或其他内容中，我不会感到惊讶。

您的第一个手动矢量化版本甚至没有完成相同的工作，只是掩盖了所有非符号位以保留-0.0f或+0.0f。因此它将编译为一个vandps ymm0, ymm7, [rdi]和一个带有vmovups [rdi], ymm0的SIMD存储，以及一些循环开销。

不是将_mm256_or_ps与set1(1.0f)相加会减慢它的速度，您仍然会遇到缓存带宽或每1时钟存储吞吐量的瓶颈。

然后，您将其编辑为可锁定在-1.0f .. +1.0f范围内的版本，从而使幅度小于1.0的输入保持不变。这不会比两个按位运算慢，除了Haswell（魔鬼峡谷）仅在端口5上运行FP布尔值，而在端口0或端口1上实际运行FP东西。

尤其是如果您不对浮点数做任何其他事情，您实际上将想使用_si256内部函数在它们上仅使用AVX2整数指令，以提高Haswell的速度。（但是，如果没有AVX2，您的代码将无法运行。）

在Skylake及更高版本上，FP布尔值可以使用所有3个矢量ALU端口。（https://agner.org/optimize/用于说明表和uarch指南。）

您的代码应类似于：

// outside the loop if you want
const __m256i ones = _mm256_castps_si256(_mm256_set1_ps(1.0f));

for (something ; p += whatever) {
    __m256i floats = _mm256_load_si256( (const __m256i*)p );
    __m256i signs = _mm256_and_si256(floats,  _mm256_set1_epi32(0x80000000));
    __m256i applied = _mm256_or_si256(signs, ones);
    _mm256_store_si256((__m256i*)p, applied);

}

为什么用__m256而不是'float'可以提供超过x8的性能？

1 个答案: