应用错误收集

使用AVX2将8位从32位值（m256i）解压缩到m256的最快方法

时间：2017-08-10 16:09:43

标签： c++ performance simd avx2

我有array名为A，其中包含32个unsigned char值。

我希望使用此规则在4个__m256变量中解压缩这些值，假设我们有一个关于A的所有值的0到31的索引，解压缩的4变量将具有以下值：

B_0 = A[0], A[4],  A[8], A[12], A[16], A[20], A[24], A[28]
B_1 = A[1], A[5],  A[9], A[13], A[17], A[21], A[25], A[29]
B_2 = A[2], A[6], A[10], A[14], A[18], A[22], A[26], A[30]
B_3 = A[3], A[7], A[11], A[15], A[19], A[23], A[27], A[31]

为此，我有这段代码：

const auto mask = _mm256_set1_epi32( 0x000000FF );
...
const auto A_values = _mm256_i32gather_epi32(reinterpret_cast<const int*>(A.data(), A_positions.values_, 4);

// This code bellow is equivalent to B_0 = static_cast<float>((A_value >> 24) & 0x000000FF)
const auto B_0 = _mm256_cvtepi32_ps(_mm256_and_si256(_mm256_srai_epi32(A_values, 24), mask));
const auto B_1 = _mm256_cvtepi32_ps(_mm256_and_si256(_mm256_srai_epi32(A_values, 16), mask));
const auto B_2 = _mm256_cvtepi32_ps(_mm256_and_si256(_mm256_srai_epi32(A_values, 8), mask));
const auto B_3 = _mm256_cvtepi32_ps(_mm256_and_si256(_mm256_srai_epi32(A_values, 0), mask));

这很好用，但我想知道是否有更快的方法可以做到这一点，特别是关于右移和我用来检索值的运算符。

另外，为了澄清，我说array A的大小为32，但事实并非如此，此数组包含更多值，我需要访问它＆＃ 39;来自不同位置的元素（但总是来自4 uint8_t的块），这就是我使用_mm256_i32gather_epi23来检索这些值的原因。为简单起见，我只是在此示例中限制array大小。

0 个答案:

没有答案