Question

我有两个带for循环的函数，看起来非常相似。要处理的数据量非常大，所以我试图尽可能地优化周期。第二个功能的执行时间是320秒，但第一个功能需要460秒。有人可以给我任何建议是什么产生差异以及如何优化计算？

    int ii, jj;
    double c1, c2;

    for (ii = 0; ii < n; ++ii) {
        a[jj] += b[ii] * c1;
        a[++jj] += b[ii] * c2;
    }

第二个：

    int ii, jj;
    double c1, c2;

    for (ii = 0; ii < n; ++ii) {
        b[ii] += a[jj] * c1;
        b[ii] += a[++jj] * c2;
    }

这是第一个循环的汇编输出：

    movl    -104(%rbp), %eax
    movq    -64(%rbp), %rcx
    cmpl    (%rcx), %eax
    jge LBB0_12
## BB#10:                               ##   in Loop: Header=BB0_9 Depth=5
    movslq  -88(%rbp), %rax
    movq    -48(%rbp), %rcx
    movsd   (%rcx,%rax,8), %xmm0    ## xmm0 = mem[0],zero
    mulsd   -184(%rbp), %xmm0
    movslq  -108(%rbp), %rax
    movq    -224(%rbp), %rcx        ## 8-byte Reload
    addsd   (%rcx,%rax,8), %xmm0
    movsd   %xmm0, (%rcx,%rax,8)
    movslq  -88(%rbp), %rax
    movq    -48(%rbp), %rdx
    movsd   (%rdx,%rax,8), %xmm0    ## xmm0 = mem[0],zero
    mulsd   -192(%rbp), %xmm0
    movl    -108(%rbp), %esi
    addl    $1, %esi
    movl    %esi, -108(%rbp)
    movslq  %esi, %rax
    addsd   (%rcx,%rax,8), %xmm0
    movsd   %xmm0, (%rcx,%rax,8)
    movl    -88(%rbp), %esi
    addl    $1, %esi
    movl    %esi, -88(%rbp)

和第二个：

    movl    -104(%rbp), %eax
    movq    -64(%rbp), %rcx
    cmpl    (%rcx), %eax
    jge LBB0_12
## BB#10:                               ##   in Loop: Header=BB0_9 Depth=5
    movslq  -108(%rbp), %rax
    movq    -224(%rbp), %rcx        ## 8-byte Reload
    movsd   (%rcx,%rax,8), %xmm0    ## xmm0 = mem[0],zero
    mulsd   -184(%rbp), %xmm0
    movslq  -88(%rbp), %rax
    movq    -48(%rbp), %rdx
    addsd   (%rdx,%rax,8), %xmm0
    movsd   %xmm0, (%rdx,%rax,8)
    movl    -108(%rbp), %esi
    addl    $1, %esi
    movl    %esi, -108(%rbp)
    movslq  %esi, %rax
    movsd   (%rcx,%rax,8), %xmm0    ## xmm0 = mem[0],zero
    mulsd   -192(%rbp), %xmm0
    movslq  -88(%rbp), %rax
    movq    -48(%rbp), %rdx
    addsd   (%rdx,%rax,8), %xmm0
    movsd   %xmm0, (%rdx,%rax,8)
    movl    -88(%rbp), %esi
    addl    $1, %esi
    movl    %esi, -88(%rbp)

原始函数要大得多，所以这里我只提供负责那些for循环的部分。其余的c代码及其汇编程序输出对于这两个函数完全相同。

Answer 1

The structure of that calculation is pretty weird, but it can be optimized significantly. Some problems with that code are

reloading data from a pointer after writing to an other pointer that isn't known to not alias. I assume they won't alias because this algorithm would be even weirder if that was allowed, but if they're really supposed to maybe alias, ignore this. In general, structure your loop body as: first load everything, do calculations, then store back. Don't mix loading and storing, it makes the compiler more conservative.
reloading data that was stored in the previous iteration. The compiler can see through this a bit, but it complicates matters. Don't do it.
implicitly treating the first and last items differently. It looks like a nice homogeneous loop at first, but due to its weird structure it's actually special casing the first and last things.

So let's first fix the second loops, which is simpler. The only problem here is the first store to b[ii], which has to Really Happen^(tm) because it might alias with a[jj + 1]. But it can trivially be written so that that problem goes away:

for (ii = 0; ii < n; ++ii) {
    b[ii] += a[jj] * c1 + a[jj + 1] * c2;
    jj++;
}

You can tell by the assembly output that the compiler is happier now, and of course benchmarking confirms it's faster.

Old asm (only main loop, not the extra cruft):

.LBB0_14:                               # =>This Inner Loop Header: Depth=1
    vmulpd  ymm4, ymm2, ymmword ptr [r8 - 8]
    vaddpd  ymm4, ymm4, ymmword ptr [rax]
    vmovupd ymmword ptr [rax], ymm4
    vmulpd  ymm5, ymm3, ymmword ptr [r8]
    vaddpd  ymm4, ymm4, ymm5
    vmovupd ymmword ptr [rax], ymm4
    add     r8, 32
    add     rax, 32
    add     r11, -4
    jne     .LBB0_14

New asm (only main loop):

.LBB1_20:                               # =>This Inner Loop Header: Depth=1
    vmulpd  ymm4, ymm2, ymmword ptr [rax - 104]
    vmulpd  ymm5, ymm2, ymmword ptr [rax - 72]
    vmulpd  ymm6, ymm2, ymmword ptr [rax - 40]
    vmulpd  ymm7, ymm2, ymmword ptr [rax - 8]
    vmulpd  ymm8, ymm3, ymmword ptr [rax - 96]
    vmulpd  ymm9, ymm3, ymmword ptr [rax - 64]
    vmulpd  ymm10, ymm3, ymmword ptr [rax - 32]
    vmulpd  ymm11, ymm3, ymmword ptr [rax]
    vaddpd  ymm4, ymm4, ymm8
    vaddpd  ymm5, ymm5, ymm9
    vaddpd  ymm6, ymm6, ymm10
    vaddpd  ymm7, ymm7, ymm11
    vaddpd  ymm4, ymm4, ymmword ptr [rcx - 96]
    vaddpd  ymm5, ymm5, ymmword ptr [rcx - 64]
    vaddpd  ymm6, ymm6, ymmword ptr [rcx - 32]
    vaddpd  ymm7, ymm7, ymmword ptr [rcx]
    vmovupd ymmword ptr [rcx - 96], ymm4
    vmovupd ymmword ptr [rcx - 64], ymm5
    vmovupd ymmword ptr [rcx - 32], ymm6
    vmovupd ymmword ptr [rcx], ymm7
    sub     rax, -128
    sub     rcx, -128
    add     rbx, -16
    jne     .LBB1_20

That also got unrolled more (automatically), but the more significant difference (not that unrolling is useless, but reducing the loop overhead isn't such a big deal usually, it can mostly be handled by the ports that aren't busy with vector instructions) is the reduction in stores, which takes it from a ratio of 2/3 (potentially bottlenecked by store throughput where half the stores are useless) to 4/12 (bottlenecked by something that really has to happen).

Now for that first loop, once you take out the first and last iterations, it's just adding two scaled b's to every a, and then we put the first and last iterations back in separately:

a[0] += b[0] * c1;
for (ii = 1; ii < n; ++ii) {
    a[ii] += b[ii - 1] * c2 + b[ii] * c1;
}
a[n] += b[n - 1] * c2;

That takes it from this (note that this isn't even vectorized):

.LBB0_3:                                # =>This Inner Loop Header: Depth=1
    vmulsd  xmm3, xmm0, qword ptr [rsi + 8*rax]
    vaddsd  xmm2, xmm2, xmm3
    vmovsd  qword ptr [rdi + 8*rax], xmm2
    vmulsd  xmm2, xmm1, qword ptr [rsi + 8*rax]
    vaddsd  xmm2, xmm2, qword ptr [rdi + 8*rax + 8]
    vmovsd  qword ptr [rdi + 8*rax + 8], xmm2
    vmulsd  xmm3, xmm0, qword ptr [rsi + 8*rax + 8]
    vaddsd  xmm2, xmm2, xmm3
    vmovsd  qword ptr [rdi + 8*rax + 8], xmm2
    vmulsd  xmm2, xmm1, qword ptr [rsi + 8*rax + 8]
    vaddsd  xmm2, xmm2, qword ptr [rdi + 8*rax + 16]
    vmovsd  qword ptr [rdi + 8*rax + 16], xmm2
    lea     rax, [rax + 2]
    cmp     ecx, eax
    jne     .LBB0_3

To this:

.LBB1_6:                                # =>This Inner Loop Header: Depth=1
    vmulpd  ymm4, ymm2, ymmword ptr [rbx - 104]
    vmulpd  ymm5, ymm2, ymmword ptr [rbx - 72]
    vmulpd  ymm6, ymm2, ymmword ptr [rbx - 40]
    vmulpd  ymm7, ymm2, ymmword ptr [rbx - 8]
    vmulpd  ymm8, ymm3, ymmword ptr [rbx - 96]
    vmulpd  ymm9, ymm3, ymmword ptr [rbx - 64]
    vmulpd  ymm10, ymm3, ymmword ptr [rbx - 32]
    vmulpd  ymm11, ymm3, ymmword ptr [rbx]
    vaddpd  ymm4, ymm4, ymm8
    vaddpd  ymm5, ymm5, ymm9
    vaddpd  ymm6, ymm6, ymm10
    vaddpd  ymm7, ymm7, ymm11
    vaddpd  ymm4, ymm4, ymmword ptr [rcx - 96]
    vaddpd  ymm5, ymm5, ymmword ptr [rcx - 64]
    vaddpd  ymm6, ymm6, ymmword ptr [rcx - 32]
    vaddpd  ymm7, ymm7, ymmword ptr [rcx]
    vmovupd ymmword ptr [rcx - 96], ymm4
    vmovupd ymmword ptr [rcx - 64], ymm5
    vmovupd ymmword ptr [rcx - 32], ymm6
    vmovupd ymmword ptr [rcx], ymm7
    sub     rbx, -128
    sub     rcx, -128
    add     r11, -16
    jne     .LBB1_6

Nice and vectorized this time, and much less storing and loading going on.

Both changes combined made it about twice as fast on my PC but of course YMMV.

I still that this code is weird though. Note how we're modifying a[n] in the last iteration of the first loop, then use it in the first iteration of the second loop, while the other a's just sort of stand to side and watch. It's odd. Maybe it really has to be that way, but frankly it looks like a bug to me.

C for循环中的算术优化

1 个答案: