Question

使用SSE内在函数，我得到了一个四个32位浮点数的向量，它被钳位到0-255范围内并四舍五入到最接近的整数。我现在想把这四个写成字节。

有一个内在的_mm_cvtps_pi8会将32位转换为8位签名的 int，但问题是任何超过127的值都会被限制为127.我可以'找到任何可以锁定无符号8位值的指令。

我有一种直觉，我可能想要做的是_mm_cvtps_pi16和_mm_shuffle_pi8的某种组合，然后是移动指令，以获取我关心的四个字节到内存中。这是最好的方法吗？我将看看我是否可以弄清楚如何编码shuffle控制掩码。

更新：以下似乎完全符合我的要求。还有更好的方法吗？

#include <tmmintrin.h>
#include <stdio.h>

unsigned char out[8];
unsigned char shuf[8] = { 0, 2, 4, 6, 128, 128, 128, 128 };
float ins[4] = {500, 0, 120, 240};

int main()
{
    __m128 x = _mm_load_ps(ins);    // Load the floats
    __m64 y = _mm_cvtps_pi16(x);    // Convert them to 16-bit ints
    __m64 sh = *(__m64*)shuf;       // Get the shuffle mask into a register
    y = _mm_shuffle_pi8(y, sh);     // Shuffle the lower byte of each into the first four bytes
    *(int*)out = _mm_cvtsi64_si32(y); // Store the lower 32 bits

    printf("%d\n", out[0]);
    printf("%d\n", out[1]);
    printf("%d\n", out[2]);
    printf("%d\n", out[3]);
    return 0;
}

UPDATE2：根据Harold的回答，这是一个更好的解决方案：

#include <smmintrin.h>
#include <stdio.h>

unsigned char out[8];
float ins[4] = {10.4, 10.6, 120, 100000};

int main()
{   
    __m128 x = _mm_load_ps(ins);       // Load the floats
    __m128i y = _mm_cvtps_epi32(x);    // Convert them to 32-bit ints
    y = _mm_packus_epi32(y, y);        // Pack down to 16 bits
    y = _mm_packus_epi16(y, y);        // Pack down to 8 bits
    *(int*)out = _mm_cvtsi128_si32(y); // Store the lower 32 bits

    printf("%d\n", out[0]);
    printf("%d\n", out[1]);
    printf("%d\n", out[2]);
    printf("%d\n", out[3]);
    return 0;
}

Answer 1

没有从float到byte的直接转换，_mm_cvtps_pi8是一个复合。 _mm_cvtps_pi16也是一个复合词，在这种情况下，它只是做一些无意义的东西，你用shuffle撤消。他们也会回复烦人的__m64。

无论如何，我们可以转换为dwords（已签名，但无关紧要），然后打包（无符号）或将它们随机转换为字节。 _mm_shuffle_(e)pi8生成pshufb，Core2 45nm和AMD处理器并不太喜欢它，你必须从某处获得一个掩码。

无论哪种方式，您都不必先舍入到最接近的整数，转换就会这样做。至少，如果你没有搞乱舍入模式。

使用包1 :(未经测试） - 可能无用，packusdw已经输出无符号字，但packuswb再次想要签名字。保持原样，因为它被引用到其他地方。

cvtps2dq xmm0, xmm0  
packusdw xmm0, xmm0     ; unsafe: saturates to a different range than packuswb accepts
packuswb xmm0, xmm0
movd somewhere, xmm0

使用不同的随机播放：

cvtps2dq xmm0, xmm0  
packssdw xmm0, xmm0     ; correct: signed saturation on first step to feed packuswb
packuswb xmm0, xmm0
movd somewhere, xmm0

使用随机播放:(未经测试）

cvtps2dq xmm0, xmm0
pshufb xmm0, [shufmask]
movd somewhere, xmm0

shufmask: db 0, 4, 8, 12, 80h, 80h, 80h, 80h, 80h, 80h, 80h, 80h, 80h, 80h, 80h, 80h

Answer 2

我们可以通过使用带符号饱和度的第一阶段打包来解决无符号钳位问题。 [0-255]适合带符号的16位int，因此该范围内的值将保持未释放状态。超出该范围的值将保持在它的同一侧。因此，签名16 - > unsigned8步骤将正确地钳制它们。

;; SSE2: good for arrays of inputs
cvtps2dq xmm0, [rsi]      ; 4 floats
cvtps2dq xmm1, [rsi+16]   ; 4 more floats
packssdw xmm0, xmm1       ; 8 int16_t

cvtps2dq xmm1, [rsi+32]
cvtps2dq xmm2, [rsi+48]
packssdw xmm1, xmm2       ; 8 more int16_t
                          ; signed because that's how packuswb treats its input
packuswb xmm0, xmm1       ; 16 uint8_t
movdqa   [rdi], xmm0

这只需要SSE2，而不是packusdw的SSE4.1。

我认为这就是SSE2仅包含从dword到word的signed pack的原因，但是sign和unsigned pack都是从word到byte。 packuswd仅在您的最终目标是uint16_t时有用，而不是进一步打包。（从那时起，您需要在将标志位送到另一个包装之前将其屏蔽掉。）

如果您确实使用了packusdw -> packuswb，那么当第一步饱和到uint16_t＆gt;时，您就会出现虚假结果0x7FFF的。 packuswb会将其解释为否定int16_t并将其饱和为0. packssdw会将此类输入饱和到0x7fff，最大int16_t。

（如果你的32位输入总是＆lt; = 0x7fff，你可以使用其中任何一个，但SSE4.1 packusdw占用的指令字节比SSE2 packsswd多，并且永远不会运行得更快。）

如果你的源值不是负数，并且你只有一个4个浮点数的向量，而不是很多，你可以使用harold的pshufb想法。如果不是，则需要将负值钳位为零，而不是通过将低字节混洗到位来截断负值。

使用

;; SSE4.1, good for a single vector.  Use the PACK version above for arrays
cvtps2dq   xmm0, xmm0
pmaxsd     xmm0, zeroed-register
pshufb     xmm0, [mask]
movd       [somewhere], xmm0

可能比使用两个pack指令稍微更有效，因为pmax可以在端口1或5（Intel Haswell）上运行。 cvtps2dq仅限端口1，pshufb和pack*仅限端口5。

SSE内在函数：将32位浮点数转换为UNSIGNED 8位整数

2 个答案: