Question

这个问题与此有关：Optimal uint8_t bitmap into a 8 x 32bit SIMD "bool" vector

我想用这个签名创建一个最佳函数：

__m256i PackLeft(__m256i inputVector, __m256i boolVector);

所需的行为是在64位int的输入上，如下所示：

inputVector = {42,17,13,3}

boolVector = {true，false，true，false}

它会屏蔽false中boolVector的所有值，然后重新打包保留在左侧的值。在上面的输出中，返回值应为：

{42,13，X，X}

......其中X是“我不在乎”。

一个显而易见的方法是使用_mm_movemask_epi8从bool向量中获取一个8字节的int，在表中查找shuffle掩码，然后使用掩码进行随机播放。

但是，如果可能的话，我想避免使用查找表。有更快的解决方案吗？

Answer 1

Andreas Fredriksson在2015年GDC演讲中对此进行了很好的介绍：https://deplinenoise.files.wordpress.com/2015/03/gdc2015_afredriksson_simd.pdf

从幻灯片104开始，他介绍了如何仅使用SSSE3然后仅使用SSE2来执行此操作。

Answer 2

刚刚看到这个问题 - 或许你已经解决了这个问题，但我仍在为其他可能需要处理这种情况的程序员编写逻辑。

解决方案（采用英特尔ASM格式）如下。它包括三个步骤：

步骤0：将8位掩码转换为64位掩码，原始掩码中的每个设置位在扩展掩码中表示为8个设置位。

步骤1：使用此扩展掩码从源数据中提取相关位

步骤2：由于您需要将数据打包，我们将输出移位适当的位数。

代码如下：

; Step 0 : convert the 8 bit mask into a 64 bit mask
    xor     r8,r8
    movzx   rax,byte ptr mask_pattern
    mov     r9,rax  ; save a copy of the mask - avoids a memory read in Step 2
    mov     rcx,8   ; size of mask in bit count
outer_loop :
    shr     al,1    ; get the least significant bit of the mask into CY
    setnc   dl      ; set DL to 0 if CY=1, else 1
    dec dl      ; if mask lsb was 1, then DL is 1111, else it sets to 0000
    shrd    r8,rdx,8
    loop    outer_loop
; We get the mask duplicated in R8, except it now represents bytewise mask
; Step 1 : we extract the bits compressed to the lowest order bit
    mov     rax,qword ptr data_pattern
    pext    rax,rax,r8
; Now we do a right shift, as right aligned output is required
    popcnt  r9,r9   ; get the count of bits set in the mask
    mov     rcx,8
    sub     cl,r9b  ; compute 8-(count of bits set to 1 in the mask)
    shl     cl,3    ; convert the count of bits to count of bytes
    shl     rax,cl
;The required data is in RAX

相信这有助于

基于布尔掩码将元素移位到SIMD寄存器的左侧

2 个答案: