Question

带有-O3 -mavx -mtune=haswell的gcc 5.3 for x86-64使surprisingly bulky code能够处理代码的潜在错位输入，如：

// convenient simple example of compiler input
// I'm not actually interested in this for any real program
void floatmul(float *a) {
  for (int i=0; i<1024 ; i++)
    a[i] *= 2;
}

clang使用未对齐的加载/存储指令，但gcc执行标量intro / outro和对齐的向量循环：它剥离了第一个最多7个未对齐的迭代，将其完全展开为

序列

    vmovss  xmm0, DWORD PTR [rdi]
    vaddss  xmm0, xmm0, xmm0      ; multiply by two
    vmovss  DWORD PTR [rdi], xmm0
    cmp     eax, 1
    je      .L13
    vmovss  xmm0, DWORD PTR [rdi+4]
    vaddss  xmm0, xmm0, xmm0
    vmovss  DWORD PTR [rdi+4], xmm0
    cmp     eax, 2
    je      .L14
    ...

这似乎非常可怕，尤其是对于具有uop缓存的CPU。我报告了一个gcc bug关于此问题，并建议使用gcc在剥离未对齐迭代时可以使用的更小/更好的代码。但它可能仍然不是最佳的。

这个问题是关于AVX 的最佳选择。我问的是gcc和其他编译器可以/应该使用的一般情况解决方案。（我没有找到任何gcc邮件列表点击讨论这个，但没有花很长时间看。）

可能会有多个答案，因为-mtune=haswell的最佳答案可能与-mtune=bdver3（压路机）的最佳答案不同。然后是允许指令集扩展时最佳的问题（例如，AVX2用于256b整数，BMI1用于在较少的指令中将计数转换为位掩码）。

我知道Agner Fog的优化装配指南，第13.5节访问未对齐数据和部分向量。他建议使用未对齐的访问，在开始和/或结束时进行重叠写入，或者从对齐的访问中对数据进行混洗（但PALIGNR仅需要imm8计数，因此2x pshufb / {{1} }）。他认为por没有用，可能是因为它对AMD的表现有多糟糕。我怀疑如果调整英特尔，那值得考虑。如何生成正确的掩码并不明显，因此问题标题。

可能会发现，最好只使用未对齐的访问，就像clang那样。对于短缓冲区，对齐的开销可能会从避免主循环的cacheline拆分中获得任何好处。对于大缓冲区，主内存或L3作为瓶颈可能会隐藏高速缓存行拆分的惩罚。如果有人有实验数据支持他们调整的任何真实代码，那也是有用的信息。

VMASKMOVPS确实可用于英特尔目标。（SSE版本很糟糕，有一个隐含的非时间提示，但AVX版本没有。甚至还有一个新的内在函数，以确保你没有获得128b操作数的SSE版本：VMASKMOVPS ）AVX版本为only a little bit slow on Haswell：

3 uops / 4c延迟/ 1-per-2c吞吐量作为负载。
作为256b存储的4 uops / 14c延迟/ 1-per-2c吞吐量。
作为128b商店的4 uops / 13c延迟/ 1-per-1c吞吐量。

AMD CPU上的商店形式仍然非常缓慢，包括Jaguar（每22c输出1个）和Bulldozer系列：Steamroller上的每16c 1个（Bulldozer上类似），或Piledriver上每1~180c吞吐量1个。

但是如果我们确实想要使用_mm128_maskstore_ps，我们需要一个在每个元素中设置高位的向量，它应该实际加载/存储。 PALIGNR和PSRLDQ（用于全向量的向量）只需要编译时常量计数。

请注意，其他位无关紧要：它不必是全1，因此将一些设置位分散到元素的高位是有可能的。

Answer 1

将VMOVMASKPS的掩码从窗口加载到表中。 AVX2，或带有一些额外指令或更大表的AVX1。

掩码还可以用于寄存器中的ANDPS缩减，需要对每个元素进行一次精确计数。正如Stephen Canon在对OP的评论中指出的那样，流水线负载可以允许重叠的未对齐商店即使是像我选择的示例那样的就地重写功能，因此VMASKMOVPS 不这里是最好的选择。

这应该适用于Intel CPU，尤其是Haswell以及后来的AVX2。

Agner Fog获取pshufb掩码的方法实际上提供了一个非常有效的想法：从表中获取未对齐的负载来获取数据窗口。而不是使用巨大的掩码表，而是使用索引作为对内存中数据进行字节移位的方法。

屏蔽LSB-第一个字节顺序（因为它们存储在内存中），而不是矢量中{X3,X2,X1,X0}个元素的常用表示法。如上所述，它们与对齐的窗口对齐，包括内存中输入数组的开始/结束。

start misalign count = 0：mask = all-ones（Aligned case）
start misalign count = 1：mask = {0,-1,-1,-1,-1,-1,-1,-1}（在第一个32B中跳过一个）
...
开始错位计数= 7：mask = {0, 0, 0, 0, 0, 0, 0,-1}（跳过除前一个32B中的一个）
结束错位计数= 0：没有尾随元素。 mask = all-ones（Aligned case）。
这是奇怪的情况，与count = 1 不相似。这个特殊情况的一些额外指令值得避免额外的循环迭代和使用全零掩码的清理。
end misalign count = 1：一个尾随元素。 mask = {-1, 0, 0, 0, 0, 0, 0, 0}
...
结束错位数= 7：七个尾随元素。 mask = {-1,-1,-1,-1,-1,-1,-1, 0}

未经测试的代码，假设存在错误

section .data
align 32  ; preferably no cache-line boundaries inside the table

; byte elements, to be loaded with pmovsx. all-ones sign-extends
    DB  0,  0,  0,  0,  0,  0,  0,  0
masktable_intro:                      ; index with 0..-7
    DB -1, -1, -1, -1, -1, -1, -1, -1
masktable_outro:                      ; index with -8(aligned), or -1..-7
    DB  0,  0,  0,  0,  0,  0,  0,  0

; the very first and last 0 bytes are not needed, since we avoid an all-zero mask.


section .text
global floatmul   ; (float *rdi)
floatmul:
    mov    eax, edi
    and    eax, 0x1c            ; 0x1c = 7 << 2 = 0b11100

    lea    rdx, [rdi + 4096 - 32]  ; one full vector less than the end address (calculated *before* masking for alignment).
                ;;  replace 4096 with rsi*4 if rsi has the count (in floats, not bytes)

    and    rdi, ~0x1c           ; Leave the low 2 bits alone, so this still works on misaligned floats.

    shr    eax, 2               ; misalignment-count, in the range [0..7]

    neg        rax
    vpmovsxbd  ymm0, [masktable_intro + rax]  ; Won't link on OS X: Need a separate LEA for RIP-relative

    vmaskmovps ymm1, ymm0, [rdi]
    vaddps     ymm1, ymm1, ymm1   ; *= 2.0
    vmaskmovps [rdi], ymm0, ymm1

    ;;; also prepare the cleanup mask while the table is still hot in L1 cache

    ; if the loop count known to be a multiple of the vector width,
    ; the alignment of the end will be the same as the alignment of the start
    ; so we could just invert the mask
    ; vpxor    xmm1, xmm1, xmm1     ; doesn't need an execution unit
    ; vpcmpeqd ymm0, ymm1, ymm0

    ; In the more general case: just re-generate the mask from the one-past-the-end addr
    mov    eax, edx
    xor    ecx, ecx      ; prep for setcc
    and    eax, 0x1c     ; sets ZF when aligned
    setz   cl            ; rcx=1 in the aligned special-case, else 0
    shr    eax, 2
    lea    eax, [rax + rcx*8]   ; 1..7, or 8 in the aligned case
    neg    rax
    vpmovsxbd  ymm0, [masktable_outro + rax]


.loop:
    add      rdi, 32
    vmovups  ymm1, [rdi]    ; Or vmovaps if you want to fault if the address isn't 4B-aligned
    vaddps   ymm1, ymm1, ymm1   ; *= 2.0
    vmovups  [rdi], ymm1
    cmp      rdi, rdx           ; while( (p+=8) < (start+1024-8) )
    jb    .loop        ; 5 fused-domain uops, yuck.

    ; use the outro mask that we generated before the loop for insn scheduling / cache locality reasons.

    vmaskmov ymm1, ymm0, [rdi]
    vaddps     ymm1, ymm1, ymm1   ; *= 2.0
    vmaskmovps [rdi], ymm0, ymm1
    ret


    ; vpcmpeqd ymm1, ymm1, ymm1   ; worse way to invert the mask: dep-chain breaker but still needs an execution unit to make all-ones instead of all-zeros.
    ; vpxor    ymm0, ymm0, ymm1

这确实需要来自表的加载，该表可以在L1高速缓存中丢失，并且需要15B的表数据。（如果循环计数也是可变的，则为24B，我们必须单独生成结束掩码。）

无论哪种方式，在生成错位计数和对齐的起始地址的4条指令之后，获取掩码只需要单 vpmosvsxbd指令。（ymm，mem形式可以微融合，因此它是2 uops）。这需要AVX2。

没有AVX2：

2x vpmovsxbd分为两个128b寄存器（[masktable_intro + rax]和[masktable_intro + rax + 4]）
vinsertf128

或者:(更多的insn，更多的shuffle-port压力，但负载端口压力更小）

vpmovsxbw成128b寄存器
vpunpcklwd / vpunpckhwd分为两个xmm regs（两个src1 = src2）
vinsertf128

或者：

vmovdqu来自60B的DWORD表（DD）而不是字节（DB）。这实际上保存相对于AVX2的insn：address & 0x1c是索引，不需要右移2。整个表仍然适合缓存行，但是没有其他常量的空间可以使用。

开销：

整数运算：在开始时使用5个uop来获取索引并对齐开始指针。 7 uops获取结束掩码的索引。如果循环元素计数是矢量宽度的倍数，那么除了简单地使用未对齐之外，总共有12个GP寄存器uop。
AVX2：两个2-fused-domain-uop vector insn从GP reg中的[0..7]索引转到YMM reg中的掩码。（一个用于开始掩码，一个用于结束掩码）。使用24B表，在具有字节粒度的8B窗口中访问。
AVX：六个1-fused-domain-uop vector insns（三个开始，三个结束）。对于表的RIP相对寻址，这些指令中的四个将是[base+index]并且不会使用微熔丝，因此额外的两个整数insn可能会更好。

循环内的代码被复制3次。

TODO：写一个动态生成掩码的另一个答案，可能是64b reg中的字节，然后将其解压缩到256b。也许是一个位移，或BMI2的BZHI（-1，计数）？

Answer 2

仅限AVX：在开始/结束时进行未对齐访问，管道加载以避免在重写时出现问题。

感谢@StephenCanon指出，对于VMASKMOVPS可以帮助循环未对齐缓冲区的任何事情，这比VMASKMOVPS要好。

这可能有点期待编译器做循环转换，尤其是。因为明显的方式可以使Valgrind不高兴（见下文）。

section .text
global floatmul   ; (float *rdi)
floatmul:

    lea    rdx, [rdi + 4096 - 32]  ; one full vector less than the end address (calculated *before* masking for alignment).
                ;;  replace 4096 with rsi*4 if rsi has the count (in floats, not bytes)

    vmovups  ymm0, [rdi]
    vaddps   ymm0, ymm0, ymm0   ; *= 2.0
    ; don't store yet

    lea    rax, [rdi+32]
    and    rax, ~0x1c           ; 0x1c = 7 << 2 = 0b11100
    vmovups  ymm1, [rax]        ; first aligned vector, for use by first loop iteration

    vmovups  [rdi], ymm0        ; store the first unaligned vector
    vmovups  ymm0, [rdx]        ; load the *last* unaligned vector

.loop:
      ;; on entry: [rax] is already loaded into ymm1
    vaddps   ymm1, ymm1, ymm1   ; *= 2.0
    vmovups  [rax]              ; vmovaps would fault if p%4 != 0
    add      rax, 32
    vmovups  ymm1, [rax]
    cmp      rax, rdx           ; while( (p+=8) < (endp-8) );
    jb  .loop

    ; discard ymm1.  It includes data from beyond the end of the array (aligned case: same as ymm0)

    vaddss   ymm0, ymm0, ymm0   ; the last 32B, which we loaded before the loop
    vmovups  [rdx], ymm0
    ret

    ;   End alignment:
    ; a[] = XXXX XXXX ABCD E___    _ = garbage past the end
    ;                  ^rdx
    ;       ^rax ^rax ^rax ^rax(loop exit)

    ;   ymm0 = BCDE
    ;   ymm1 loops over ..., XXXX, ABCD, E___
    ;   The last load off the end of the array includes garbage
    ;     because we pipeline the load for the next iteration

在循环开始时从数组末尾进行加载似乎有点奇怪，但希望它不会混淆硬件预取程序，或者减慢从内存中获取数组流的开始。

开销：

总共2个额外的整数微量（用于设置aligned-start）。我们已经将结束指针用于正常的循环结构，因此它是免费的。
循环体的2个额外副本（加载/计算/存储）。（第一次和最后一次迭代去皮）。

自动向量化时，编译器可能不会对发出这样的代码感到高兴。 Valgrind will report accesses outside of array bounds，通过单步执行和解码指令来查看他们正在访问的内容。因此，仅仅停留在与数组中最后一个元素相同的页面（和缓存行）中是不够的。另请注意，如果输入指针不是4B对齐的，我们可能会读入另一个页面和段错误。

为了让Valgrind满意，我们可以提前停止循环两个向量宽度，以完成数组未对齐的最后一个向量宽度的特殊情况加载。这需要在一个额外的时间内复制循环体（在这个例子中是微不足道的，但是故意这是微不足道的。）或者可以通过让介绍代码跳到循环的中间来避免流水线操作。（这对于uop-cache来说可能不是最佳的，但是：循环体的（部分）可能会在uop缓存中结束两次。）

TODO：编写一个在中途跳入循环的版本。

使用未对齐缓冲区进行矢量化：使用VMASKMOVPS：根据未对齐计数生成掩码？或者根本不使用那个insn

2 个答案:

将VMOVMASKPS的掩码从窗口加载到表中。 AVX2，或带有一些额外指令或更大表的AVX1。

没有AVX2：

开销：

仅限AVX：在开始/结束时进行未对齐访问，管道加载以避免在重写时出现问题。

开销：