Question

我想做的是：

将输入浮点数乘以固定因子。
将它们转换为8位带符号的字符。

请注意，大多数输入都具有较小的绝对值范围，例如[-6，6]，以便固定因子可以将它们映射到[-127，127]。

我仅处理avx2指令集，因此无法使用诸如_mm256_cvtepi32_epi8之类的内在函数。我想使用_mm256_packs_epi16，但它将两个输入混合在一起。：（

我还编写了一些将32位浮点数转换为16位int的代码，它的工作原理与我想要的完全一样。

void Quantize(const float* input, __m256i* output, float quant_mult, int num_rows, int width) {
  // input is a matrix actuaaly, num_rows and width represent the number of rows and columns of the matrix
  assert(width % 16 == 0);

  int num_input_chunks = width / 16;

  __m256 avx2_quant_mult = _mm256_set_ps(quant_mult, quant_mult, quant_mult, quant_mult,
                                     quant_mult, quant_mult, quant_mult, quant_mult);

  for (int i = 0; i < num_rows; ++i) {
    const float* input_row = input + i * width;
    __m256i* output_row = output + i * num_input_chunks;
    for (int j = 0; j < num_input_chunks; ++j) {
      const float* x = input_row + j * 16;
      // Process 16 floats at once, since each __m256i can contain 16 16-bit integers.

      __m256 f_0 = _mm256_loadu_ps(x);
      __m256 f_1 = _mm256_loadu_ps(x + 8);

      __m256 m_0 = _mm256_mul_ps(f_0, avx2_quant_mult);
      __m256 m_1 = _mm256_mul_ps(f_1, avx2_quant_mult);

      __m256i i_0 = _mm256_cvtps_epi32(m_0);
      __m256i i_1 = _mm256_cvtps_epi32(m_1);

      *(output_row + j) = _mm256_packs_epi32(i_0, i_1);
    }
  }
}

欢迎任何帮助，非常感谢！

Answer 1

对于具有多个源向量的良好吞吐量，_mm256_packs_epi16具有2个输入向量而不是产生较窄的输出是一件好事。（AVX512 _mm256_cvtepi32_epi8不一定是最有效的处理方式，因为具有存储目标的版本会解码为多个uops，或者常规版本会为您提供多个小输出，需要单独存储。）

还是您在抱怨它的车内运行方式？是的，这很烦人，但是_mm256_packs_epi32做同样的事情。如果您的输出中可以有交错的数据组，也可以这样做。

您最好的选择是将2个矢量降低到1，并分两步进行车道内打包（因为没有交叉通道包）。然后使用一个过马路的混洗对其进行修复。

#include <immintrin.h>
// loads 128 bytes = 32 floats
// converts and packs with signed saturation to 32 int8_t
__m256i pack_float_int8(const float*p) {
    __m256i a = _mm256_cvtps_epi32(_mm256_loadu_ps(p));
    __m256i b = _mm256_cvtps_epi32(_mm256_loadu_ps(p+8));
    __m256i c = _mm256_cvtps_epi32(_mm256_loadu_ps(p+16));
    __m256i d = _mm256_cvtps_epi32(_mm256_loadu_ps(p+24));
    __m256i ab = _mm256_packs_epi32(a,b);        // 16x int16_t
    __m256i cd = _mm256_packs_epi32(c,d);
    __m256i abcd = _mm256_packs_epi16(ab, cd);   // 32x int8_t
    // packed to one vector, but in [ a_lo, b_lo, c_lo, d_lo | a_hi, b_hi, c_hi, d_hi ] order
    // if you can deal with that in-memory format (e.g. for later in-lane unpack), great, you're done

    // but if you need sequential order, then vpermd:
    __m256i lanefix = _mm256_permutevar8x32_epi32(abcd, _mm256_setr_epi32(0,4, 1,5, 2,6, 3,7));
    return lanefix;
}

（Compiles nicely on the Godbolt compiler explorer）。

循环调用此函数，然后_mm256_store_si256生成向量。

（对于uint8_t未签名的目的地，请在16-> 8步骤中使用_mm256_packus_epi16，其他步骤保持不变。我们仍然使用32-> 16签名包装，因为16-> u8 vpackuswb打包仍将其epi16 输入视为带符号。您需要将-1视为-1，而不是{{ 1}}，以使无符号饱和度将其钳位为0。）

每个256位存储共有4个shuffle，每个时钟吞吐量1个shuffle将成为Intel CPU的瓶颈。您应该获得每个时钟一个浮点向量的吞吐量，这是端口5上的瓶颈。（https://agner.org/optimize/）。或者，如果L2中的数据不热，那么可能会成为内存带宽的瓶颈。

如果只需要一个单个向量，则可以考虑使用+0xFFFF将每个epi32元素的低字节放入每个通道的低32位，然后{{ 3}}。用于穿越车道。

另一个单向量替代方法（对Ryzen有利）是extracti128 + 128位packssdw + packsswb。但这仅在您仅执行单个向量的情况下仍然很好。（仍然使用Ryzen，您将需要使用128位向量来避免额外的通道交叉改组，因为Ryzen将每条256位指令分成（至少）2个128位uops。）

如何将32位float转换为8位有符号字符？

2 个答案: