Question

我试图理解为什么一些简单的循环以他们的速度运行

第一种情况：

L1:
    add rax, rcx  # (1)
    add rcx, 1    # (2)
    cmp rcx, 4096 # (3)
    jl L1

根据IACA，吞吐量是1个周期，瓶颈是端口1,0,5。我不明白为什么它是1 cylce。毕竟我们有两个循环携带的依赖项：

(1) -> (1) ( Latancy is 1) 
(2) -> (2), (2) -> (1), (2) -> (3) (Latency is 1 + 1 + 1).

这个latancy是循环传输的，所以它应该使我们的迭代变慢。

Throughput Analysis Report
--------------------------
Block Throughput: 1.00 Cycles       Throughput Bottleneck: Port0, Port1, Port5

Port Binding In Cycles Per Iteration:
-------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |
-------------------------------------------------------------------------
| Cycles | 1.0    0.0  | 1.0  | 0.0    0.0  | 0.0    0.0  | 0.0  | 1.0  |
-------------------------------------------------------------------------


| Num Of |              Ports pressure in cycles               |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |    |
---------------------------------------------------------------------
|   1    | 1.0       |     |           |           |     |     | CP | add rax, rcx
|   1    |           | 1.0 |           |           |     |     | CP | add rcx, 0x1
|   1    |           |     |           |           |     | 1.0 | CP | cmp rcx, 0x1000
|   0F   |           |     |           |           |     |     |    | jl 0xfffffffffffffff2
Total Num Of Uops: 3

第二个案例：

L1:    
    add rax, rcx
    add rcx, 1
    add rbx, rcx
    cmp rcx, 4096
    jl L1

Block Throughput: 1.65 Cycles       Throughput Bottleneck: InterIteration

Port Binding In Cycles Per Iteration:
-------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |
-------------------------------------------------------------------------
| Cycles | 1.4    0.0  | 1.4  | 0.0    0.0  | 0.0    0.0  | 0.0  | 1.3  |


| Num Of |              Ports pressure in cycles               |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |    |
---------------------------------------------------------------------
|   1    | 0.6       | 0.3 |           |           |     |     |    | add rax, rcx
|   1    | 0.3       | 0.6 |           |           |     |     | CP | add rcx, 0x1
|   1    | 0.3       | 0.3 |           |           |     | 0.3 | CP | add rbx, rcx
|   1    |           |     |           |           |     | 1.0 | CP | cmp rcx, 0x1000
|   0F   |           |     |           |           |     |     |    | jl 0xffffffffffffffef

我越不明白为什么吞吐量是1.65。

Answer 1

在第一个循环中，有两个dep链，一个用于rax，另一个用于rcx。

add rax, rcx  # depends on rax and rcx from the previous iteration, produces rax for the next iteration

add rcx, 1    # latency = 1

add rcx,1 - ＆gt;的2周期延迟dep链。 add rax, rcx跨越2次迭代（因此它已经有时间发生）并且它甚至都不是循环传输的（因为rax不会反馈回add rcx,1）。

在任何给定的迭代中，只需要前一次迭代的结果来产生此迭代的结果。迭代中没有循环携带的依赖关系，只在迭代之间。

就像我解释in answer to your question a couple days ago一样，cmp/jcc不是循环传输的dep链的一部分。

如果cmp或cmov读取它生成的标志输出，则

setcc只是部署链的一部分。预测控制依赖性，而不是等待数据依赖性。

实际上，在我的E6600上（第一代Core2，目前我还没有SnB）：

; Linux initializes most registers to zero on process startup, and I'm lazy so I depended on this for this one-off test.  In real code, I'd xor-zero ecx
    global _start
_start:
L1:
    add eax, ecx        ; (1)
    add ecx, 1          ; (2)
    cmp ecx, 0x80000000 ; (3)
    jb L1            ; can fuse with cmp on Core2 (in 32bit mode)

    mov eax, 1
    int 0x80

我将它移植到32位，因为Core2只能在32位模式下进行宏融合，并使用jb，因为Core2只能对无符号分支条件进行宏融合。我使用了一个大循环计数器，所以我不需要在此之外的另一个循环。（IDK为什么你选择了一个很小的循环计数，如4096.你确定你没有测量短循环之外的其他东西的额外开销吗？）

$ yasm -Worphan-labels -gdwarf2 -felf tinyloop.asm && ld -m elf_i386 -o tinyloop tinyloop.o
$ perf stat -e task-clock,cycles,instructions,branches ./tinyloop

Performance counter stats for './tinyloop':

    897.994122      task-clock (msec)         #    0.993 CPUs utilized          
 2,152,571,449      cycles                    #    2.397 GHz                    
 8,591,925,034      instructions              #    3.99  insns per cycle        
 2,147,844,593      branches                  # 2391.825 M/sec                  

   0.904020721 seconds time elapsed

因此每个周期运行3.99次insns，这意味着每个周期〜一次迭代。

如果你的Ivybridge运行的确切代码只有一半的速度，我会感到惊讶。更新：根据聊天中的讨论，是的，似乎IVB确实只获得2.14 IPC。（每1.87c一次迭代）。 将add rax, rcx更改为add rax, rbx或其他内容以消除上一次迭代对循环计数器的依赖性，使吞吐量达到3.8 IPC（每1.05c一次迭代）。 I不明白为什么会这样。

使用不依赖于宏融合的类似循环（add / inc ecx / jnz）我每1c也会得到一次迭代。（每个周期2.99个insn。）

但是，在循环中使用第4个insn也会读取ecx会使其大大减慢。 Core2每个时钟可发出4个uop，即使（如SnB / IvB）它只有三个ALU端口。（很多代码都包含内存uops，所以这确实有意义。）

add eax, ecx       ; changing this to add eax,ebx  helps when there are 4 non-fusing insns in the loop
; add edx, ecx     ; slows us down to 1.34 IPC, or one iter per 3c
; add edx, ebx     ; only slows us to 2.28 IPC, or one iter per 1.75c
                   ; with neither:    3    IPC, or one iter per 1c
inc ecx
jnz L1             # loops 2^32 times, doesn't macro-fuse on Core2

我预计仍然可以在3个IPC运行，或者每4/3运行一个= 1.333c。但是，预SnB CPU存在更多瓶颈，如ROB读取和寄存器读取瓶颈。 SnB切换到物理寄存器文件消除了这些减速。

在你的第二个循环中，IDK为什么它不会在1.333c的一次迭代中运行。 insn更新rbx直到该迭代的其他指令之后才能运行，但这就是无序执行的目的。你确定它和每1.85个周期的一次迭代一样慢吗？您使用perf测量了足够高的计数以获得有意义的数据？（rdtsc循环计数不准确，除非您禁用turbo和频率缩放，但perf计数器仍然计算实际核心循环。）

我不希望它与

有太大的不同

L1:    
    add rax, rcx
    add rbx, rcx      # before/after inc rcx shouldn't matter because of out-of-order execution
    add rcx, 1
    cmp rcx, 4096
    jl L1

短循环的延迟

1 个答案: