Question

这是DV的典型问题，所以在发布之前我已经犹豫了很多...

我知道这个question被标记为重复，但我的测试（如果它们是好的：它们是好的吗？这是问题的一部分）往往表明情况并非如此。

开始时，我做了一些测试，将 for 循环与同时循环进行比较。

这表明 for 循环更好。

但是进一步说，或不是重点：差异与：

有关

for (int l = 0; l < loops;l++) {

或

for (int l = 0; l != loops;l++) {

如果你运行它（在Windows 10，Visual Studio 2017，发布），你会发现第一个比第二个快两倍以上。

由于某些原因，编译器是否能够优化其中一个或另一个，因此很难（对于我这样的新手）。但...

简短问题

为什么？

更长的问题

完整的代码如下：

对于＆＃39;＆lt;＆＃39;循环：

int forloop_inf(int loops, int iterations)
{
    int n = 0;
    int x = n;

    for (int l = 0; l < loops;l++) {
        for (int i = 0; i < iterations;i++) {
            n++;
            x += n;
        }
    }

    return x;
}

对于＆＃39;！=＆＃39;循环：

int forloop_diff(int loops, int iterations)
{
    int n = 0;
    int x = n;

    for (int l = 0; l != loops;l++) {
        for (int i = 0; i != iterations;i++) {
            n++;
            x += n;
        }
    }

    return x;
}

在这两种情况下，内部计算只是为了避免编译器跳过所有循环。

分别称为：

printf("for loop inf %f\n", monitor_int(loops, iterations, forloop_inf, &result));
printf("%d\n", result);

和

printf("for loop diff %f\n", monitor_int(loops, iterations, forloop_diff, &result));
printf("%d\n", result);

其中loops = 10 * 1000且iterations = 1000 * 1000。

monitor_int 的位置是：

double monitor_int(int loops, int iterations, int(*func)(int, int), int *result)
{
    clock_t start = clock();

    *result = func(loops, iterations);

    clock_t stop = clock();

    return (double)(stop - start) / CLOCKS_PER_SEC;
}

以秒为单位的结果是：

for loop inf 2.227 seconds
for loop diff 4.558 seconds

那么，即使所有这一切的利益都与循环内部所做的相对于循环本身的重量相关，为什么会出现这样的差异呢？

修改

您可以找到here已审核的完整源代码，以便以随机顺序多次调用函数。

相应的反汇编是here（使用 dumpbin / DISASM CPerf2.exe 获得）。

运行它，我现在获得：

＆＃39;！=＆＃39; 0.045231（平均493次运行）
＆＃39;＆LT;＆＃39; 0.031010（平均507次运行）

我不知道如何在Visual Studio中设置O3，编译命令行如下：

/ permissive- /Yu"stdafx.h" / GS / GL / W3 / Gy / Zc：wchar_t / Zi / Gm- / O2 / sdl /Fd"x64\Release\vc141.pdb" / Zc：inline / fp：precise / D＆＃34; NDEBUG＆＃34; / D＆＃34; _CONSOLE＆＃34; / D＆＃34; _UNICODE＆＃34; / D＆＃34; UNICODE＆＃34; / errorReport：prompt / WX- / Zc：forScope / Gd / Oi / MD / FC / Fa＆＃34; x64 \ Release \＆＃34; / EHsc / nologo / Fo＆＃34; x64 \ Release \＆＃34; / Ot /Fp"x64\Release\CPerf2.pch" / diagnostics：classic

循环的代码在上面，这是运行它的随机方式：

typedef int(loop_signature)(int, int);

void loops_compare()
{
    int loops = 1 * 100;
    int iterations = 1000 * 1000;
    int result;

    loop_signature *functions[2] = {
        forloop_diff,
        forloop_inf
    };

    int n_rand = 1000;

    int n[2] = { 0, 0 };
    double cum[2] = { 0.0, 0.0 };

    for (int i = 0; i < n_rand;i++) {
        int pick = rand() % 2;
        loop_signature *fun = functions[pick];

        double time = monitor(loops, iterations, fun, &result);
        n[pick]++;
        cum[pick] += time;
    }

    printf("'!=' %f (%d) / '<' %f (%d)\n", cum[0] / (double)n[0], n[0], cum[1] / (double)n[1], n[1]);
}

和反汇编（循环仅起作用，但不确定它是上面链接的好提取物）：

?forloop_inf@@YAHHH@Z:
  0000000140001000: 48 83 EC 08        sub         rsp,8
  0000000140001004: 45 33 C0           xor         r8d,r8d
  0000000140001007: 45 33 D2           xor         r10d,r10d
  000000014000100A: 44 8B DA           mov         r11d,edx
  000000014000100D: 85 C9              test        ecx,ecx
  000000014000100F: 7E 6F              jle         0000000140001080
  0000000140001011: 48 89 1C 24        mov         qword ptr [rsp],rbx
  0000000140001015: 8B D9              mov         ebx,ecx
  0000000140001017: 66 0F 1F 84 00 00  nop         word ptr [rax+rax]
                    00 00 00
  0000000140001020: 45 33 C9           xor         r9d,r9d
  0000000140001023: 33 D2              xor         edx,edx
  0000000140001025: 33 C0              xor         eax,eax
  0000000140001027: 41 83 FB 02        cmp         r11d,2
  000000014000102B: 7C 29              jl          0000000140001056
  000000014000102D: 41 8D 43 FE        lea         eax,[r11-2]
  0000000140001031: D1 E8              shr         eax,1
  0000000140001033: FF C0              inc         eax
  0000000140001035: 8B C8              mov         ecx,eax
  0000000140001037: 03 C0              add         eax,eax
  0000000140001039: 0F 1F 80 00 00 00  nop         dword ptr [rax]
                    00
  0000000140001040: 41 FF C1           inc         r9d
  0000000140001043: 83 C2 02           add         edx,2
  0000000140001046: 45 03 C8           add         r9d,r8d
  0000000140001049: 41 03 D0           add         edx,r8d
  000000014000104C: 41 83 C0 02        add         r8d,2
  0000000140001050: 48 83 E9 01        sub         rcx,1
  0000000140001054: 75 EA              jne         0000000140001040
  0000000140001056: 41 3B C3           cmp         eax,r11d
  0000000140001059: 7D 06              jge         0000000140001061
  000000014000105B: 41 FF C2           inc         r10d
  000000014000105E: 45 03 D0           add         r10d,r8d
  0000000140001061: 42 8D 0C 0A        lea         ecx,[rdx+r9]
  0000000140001065: 44 03 D1           add         r10d,ecx
  0000000140001068: 41 8D 48 01        lea         ecx,[r8+1]
  000000014000106C: 41 3B C3           cmp         eax,r11d
  000000014000106F: 41 0F 4D C8        cmovge      ecx,r8d
  0000000140001073: 44 8B C1           mov         r8d,ecx
  0000000140001076: 48 83 EB 01        sub         rbx,1
  000000014000107A: 75 A4              jne         0000000140001020
  000000014000107C: 48 8B 1C 24        mov         rbx,qword ptr [rsp]
  0000000140001080: 41 8B C2           mov         eax,r10d
  0000000140001083: 48 83 C4 08        add         rsp,8
  0000000140001087: C3                 ret
  0000000140001088: CC CC CC CC CC CC CC CC                          ÌÌÌÌÌÌÌÌ
?forloop_diff@@YAHHH@Z:
  0000000140001090: 45 33 C0           xor         r8d,r8d
  0000000140001093: 41 8B C0           mov         eax,r8d
  0000000140001096: 85 C9              test        ecx,ecx
  0000000140001098: 74 28              je          00000001400010C2
  000000014000109A: 44 8B C9           mov         r9d,ecx
  000000014000109D: 0F 1F 00           nop         dword ptr [rax]
  00000001400010A0: 85 D2              test        edx,edx
  00000001400010A2: 74 18              je          00000001400010BC
  00000001400010A4: 8B CA              mov         ecx,edx
  00000001400010A6: 66 66 0F 1F 84 00  nop         word ptr [rax+rax]
                    00 00 00 00
  00000001400010B0: 41 FF C0           inc         r8d
  00000001400010B3: 41 03 C0           add         eax,r8d
  00000001400010B6: 48 83 E9 01        sub         rcx,1
  00000001400010BA: 75 F4              jne         00000001400010B0
  00000001400010BC: 49 83 E9 01        sub         r9,1
  00000001400010C0: 75 DE              jne         00000001400010A0
  00000001400010C2: C3                 ret
00000001400010C3: CC CC CC CC CC CC CC CC CC CC CC CC CC ÌÌÌÌÌÌÌÌÌÌÌÌÌ

再次编辑：

我感到惊讶的还有以下几点：

在调试中，性能相同（以及汇编代码）
那么如果在此之后出现这种差异，如何对自己编码的内容充满信心呢？（考虑到我在某处没有犯错）

Answer 1

为了进行适当的基准测试，以随机顺序多次运行函数非常重要。

typedef int(signature)(int, int);

...

int main() {
    int loops, iterations, runs;

    fprintf(stderr, "Loops: ");
    scanf("%d", &loops);
    fprintf(stderr, "Iterations: ");
    scanf("%d", &iterations);
    fprintf(stderr, "Runs: ");
    scanf("%d", &runs);

    fprintf(stderr, "Running for %d loops and %d iterations %d times.\n", loops, iterations, runs);

    signature *functions[2] = {
        forloop_inf,
        forloop_diff
    };

    int result = functions[0](loops, iterations);
    for( int i = 0; i < runs; i++ ) {
        int pick = rand() % 2;
        signature *function = functions[pick];

        int new_result;
        printf("%d %f\n", pick, monitor_int(loops, iterations, function, &new_result));
        if( result != new_result ) {
            fprintf(stderr, "got %d expected %d\n", new_result, result);
        }
    }
}

有了这个，我们可以按随机顺序进行1000次运行并找到平均时间。

在开启优化的基础上进行基准测试也很重要。询问未经优化的代码运行速度有多快。我会尝试-O2和-O3。

我的发现是，Apple LLVM version 8.0.0 (clang-800.0.42.1)执行10000次循环，-O2 forloop_inf处的1000000次迭代确实比forloop_diff快50％。

forloop_inf: 0.000009
forloop_diff: 0.000014

使用clang -O2 -S -mllvm --x86-asm-syntax=intel test.c查看the generated assembly code for -O2我可以看到many differences between the two implementations。也许知道装配的人可以告诉我们原因。

但在-O3，性能差异已不再明显。

forloop_inf: 0.000002
forloop_diff: 0.000002

这是因为at -O3 they are almost exactly the same。一个使用je，一个使用jle。就是这样。

总之，在进行基准测试时......

做多次。
随机化订单。
编译并尽可能接近您的生产方式。
- 在这种情况下，这意味着启用编译器优化。
查看汇编代码。

最重要的是。

选择最安全的代码，而不是最快的代码。

i < max比i != max更安全，因为如果i以某种方式跳过max，它仍会终止。

正如所展示的那样，随着优化的开启，它们的速度非常快，甚至没有完全优化，它们可以在0.000009秒内完成10,000,000,000次迭代。 i < max或i != max不太可能是性能瓶颈，而不管你做了100亿次。

但是i != max可能会导致错误。

Answer 2

＆＃34;＆LT;＆＃34;并不快于＆＃39;！=＆＃39;。发生的事情完全不同。

循环＆＃34; for（i = 0; i＆lt; n; ++ i）是编译器识别的模式。如果循环体没有修改i或n的指令，则编译器知道这是一个完全执行max（n - i，0）次的循环，并且可以为此生成最佳代码。

for循环＆＃34; for（i = 0; i！= n; ++ i）在实践中使用得更少，因此编译器编写者不会太烦恼它。并且迭代次数更难以确定。如果我＆gt;那么我们对有符号整数有未定义的行为，除非有语句退出循环。对于无符号数，迭代次数很棘手，因为它取决于i的类型。您将获得较少优化的代码。

Answer 3

始终查看生成的代码。

多年前，当一些μP没有一些条件分支指令或很少的标志时，它曾经是事实。因此，必须将一些条件编译为一组比较和跳转。

但现在的处理器具有非常丰富的条件分支指令（其中一些还有许多“常规”条件指令 - 例如ARM指令）和许多标志，这不再是事实。

你可以在这里玩不同的条件：https://godbolt.org/g/9DsqJm

是（真的）＆＃39;＆lt;＆＃39;快于＆＃39;！=＆＃39;在C？

3 个答案: