Question

问题可以描述如下。

输入

__m256d a, b, c, d

输出

__m256d s = {a[0]+a[1]+a[2]+a[3], b[0]+b[1]+b[2]+b[3], 
             c[0]+c[1]+c[2]+c[3], d[0]+d[1]+d[2]+d[3]}

到目前为止我已完成的工作

这似乎很容易：两个VHADD之间有一些混乱，但实际上结合AVX所有的所有排列都无法产生实现该目标所需的非常排列。让我解释一下：

VHADD x, a, b => x = {a[0]+a[1], b[0]+b[1], a[2]+a[3], b[2]+b[3]}
VHADD y, c, d => y = {c[0]+c[1], d[0]+d[1], c[2]+c[3], d[2]+d[3]}

我是否能够以相同的方式置换x和y以获得

x1 = {a[0]+a[1], a[2]+a[3], c[0]+c[1], c[2]+c[3]}
y1 = {b[0]+b[1], b[2]+b[3], d[0]+d[1], d[2]+d[3]}

然后

VHADD s, x1, y1 => s1 = {a[0]+a[1]+a[2]+a[3], b[0]+b[1]+b[2]+b[3], 
                         c[0]+c[1]+c[2]+c[3], d[0]+d[1]+d[2]+d[3]}

这是我想要的结果。

因此我只需要找到如何执行

x,y => {x[0], x[2], y[0], y[2]}, {x[1], x[3], y[1], y[3]}

不幸的是，我得出结论，使用VSHUFPD，VBLENDPD，VPERMILPD，VPERM2F128，VUNPCKHPD，VUNPCKLPD的任何组合都是不可能的。问题的关键在于，在__m256d的实例u中交换u [1]和u [2]是不可能的。

问题

这真的是死路一条吗？或者我错过了排列指令？

Answer 1

VHADD说明应遵循常规VADD。以下代码应该为您提供所需内容：

// {a[0]+a[1], b[0]+b[1], a[2]+a[3], b[2]+b[3]}
__m256d sumab = _mm256_hadd_pd(a, b);
// {c[0]+c[1], d[0]+d[1], c[2]+c[3], d[2]+d[3]}
__m256d sumcd = _mm256_hadd_pd(c, d);

// {a[0]+a[1], b[0]+b[1], c[2]+c[3], d[2]+d[3]}
__m256d blend = _mm256_blend_pd(sumab, sumcd, 0b1100);
// {a[2]+a[3], b[2]+b[3], c[0]+c[1], d[0]+d[1]}
__m256d perm = _mm256_permute2f128_pd(sumab, sumcd, 0x21);

__m256d sum =  _mm256_add_pd(perm, blend);

这给出了5条指令的结果。我希望我的常数合适。

您提出的排列当然可以实现，但它需要多个指令。很抱歉，我没有回答你那部分问题。

编辑：我无法抗拒，这是完整的排列。（再次，我尽力尝试使常量正确。）您可以看到交换u[1]和u[2]是可能的，只需要做一些工作。在第一代中跨越128位屏障是困难的。 AVX。我还想说VADD比VHADD更可取，因为VADD的吞吐量是其两倍，即使它的添加次数相同。

// {x[0],x[1],x[2],x[3]}
__m256d x;

// {x[1],x[0],x[3],x[2]}
__m256d xswap = _mm256_permute_pd(x, 0b0101);

// {x[3],x[2],x[1],x[0]}
__m256d xflip128 = _mm256_permute2f128_pd(xswap, xswap, 0x01);

// {x[0],x[2],x[1],x[3]} -- not imposssible to swap x[1] and x[2]
__m256d xblend = _mm256_blend_pd(x, xflip128, 0b0110);

// repeat the same for y
// {y[0],y[2],y[1],y[3]}
__m256d yblend;

// {x[0],x[2],y[0],y[2]}
__m256d x02y02 = _mm256_permute2f128_pd(xblend, yblend, 0x20);

// {x[1],x[3],y[1],y[3]}
__m256d x13y13 = _mm256_permute2f128_pd(xblend, yblend, 0x31);

Answer 2

我不知道任何允许你进行这种排列的指令。 AVX指令通常操作使得寄存器的高128位和低128位有些独立;没有太多能力将两半的值混合在一起。我能想到的最佳实现将基于this question的答案：

__m128d horizontal_add_pd(__m256d x1, __m256d x2)
{
    // calculate 4 two-element horizontal sums:
    // lower 64 bits contain x1[0] + x1[1]
    // next 64 bits contain x2[0] + x1[1]
    // next 64 bits contain x1[2] + x1[3]
    // next 64 bits contain x2[2] + x2[3]
    __m256d sum = _mm256_hadd_pd(x1, x2);
    // extract upper 128 bits of result
    __m128d sum_high = _mm256_extractf128_pd(sum1, 1);
    // add upper 128 bits of sum to its lower 128 bits
    __m128d result = _mm_add_pd(sum_high, (__m128d) sum);
    // lower 64 bits of result contain the sum of x1[0], x1[1], x1[2], x1[3]
    // upper 64 bits of result contain the sum of x2[0], x2[1], x2[2], x2[3]
    return result;
}

__m256d a, b, c, d;
__m128d res1 = horizontal_add_pd(a, b);
__m128d res2 = horizontal_add_pd(c, d);
// At this point:
//     res1 contains a's horizontal sum in bits 0-63
//     res1 contains b's horizontal sum in bits 64-127
//     res2 contains c's horizontal sum in bits 0-63
//     res2 contains d's horizontal sum in bits 64-127
// cast res1 to a __m256d, then insert res2 into the upper 128 bits of the result
__m256d sum = _mm256_insertf128_pd(_mm256_castpd128_pd256(res1), res2, 1);
// At this point:
//     sum contains a's horizontal sum in bits 0-63
//     sum contains b's horizontal sum in bits 64-127
//     sum contains c's horizontal sum in bits 128-191
//     sum contains d's horizontal sum in bits 192-255

哪个应该是你想要的。以上内容应该在7条指令中可行（演员不应该做任何事情;它只是编辑器的一个注释，以改变它处理res1中的值的方式），假设短{{1}函数可以由编译器内联，并且您有足够的寄存器可供使用。

与AVX一起使用4个水平双精度和

2 个答案: