Question

美好的一天！

我正在阅读这篇精彩的文章：What every programmer should know about memory。现在，我试图弄清楚CPU缓存是如何工作的，并重现缓存未命中的实验。当访问数据量增加时，目标是再现性能下降（图3.4）。我写了一个小程序，它可以重现降级，但它并没有。在我分配超过4Gb的内存后，性能下降，我不明白为什么。我认为它应该在分配12或100个MB时出现。也许程序错了，我想念一些东西？我用

Intel Core i7-2630QM
L1: 256Kb
L2: 1Mb
L3: 6Mb

这是GO列表。

main.go

package main

import (
    "fmt"
    "math/rand"
)

const (
    n0 = 1000
    n1 = 100000
)

func readInt64Time(slice []int64, idx int) int64

func main() {
    ss := make([][]int64, n0)
    for i := range ss {
        ss[i] = make([]int64, n1)
        for j := range ss[i] {
            ss[i][j] = int64(i + j)
        }
    }
    var t int64
    for i := 0; i < n0; i++ {
        for j := 0; j < n1; j++ {
            t0 := readInt64Time(ss[i], rand.Intn(n1))
            if t0 <= 0 {
                panic(t0)
            }
            t += t0
        }
    }
    fmt.Println("Avg time:", t/int64(n0*n1))
}

main.s

// func readInt64Time(slice []int64, idx int) int64
TEXT ·readInt64Time(SB),$0-40
    MOVQ    slice+0(FP), R8
    MOVQ    idx+24(FP), R9
    RDTSC
    SHLQ    $32, DX
    ORQ     DX, AX
    MOVQ    AX, R10
    MOVQ    (R8)(R9*8), R8 // Here I'm reading the memory
    RDTSC
    SHLQ    $32, DX
    ORQ     DX, AX
    SUBQ    R10, AX
    MOVQ    AX, ret+32(FP)
    RET

Answer 1

对于那些感兴趣的人。我重现了'cache-miss'行为。但是文章所描述的性能下降并不那么引人注目。以下是最终的基准列表：

<强> main.go

package main

import (
    "fmt"
    "math/rand"
    "runtime"
    "runtime/debug"
)

func readInt64Time(slice []int64, idx int) int64

const count = 2 << 25

func measure(np uint) {
    n := 2 << np
    s := make([]int64, n)
    for i := range s {
        s[i] = int64(i)
    }
    t := int64(0)
    n8 := n >> 3
    for i := 0; i < count; i++ {
        // Intex is 64 byte aligned, since cache line is 64 byte
        t0 := readInt64Time(s, rand.Intn(n8)<<3)
        t += t0
    }
    fmt.Printf("Allocated %d Kb. Avg time: %v\n",
        n/128, t/count)
}

func main() {
    debug.SetGCPercent(-1) // To eliminate GC influence
    for i := uint(10); i < 27; i++ {
        measure(i)
        runtime.GC()
    }
}

<强> main_amd64.s

// func readInt64Time(slice []int64, idx int) int64
TEXT ·readInt64Time(SB),$0-40
    MOVQ    slice+0(FP), R8
    MOVQ    idx+24(FP), R9
    RDTSC
    SHLQ    $32, DX
    ORQ     DX, AX
    MOVQ    AX, R10
    MOVQ    (R8)(R9*8), R11 // Read memory
    MOVQ    $0, (R8)(R9*8) // Write memory
    RDTSC
    SHLQ    $32, DX
    ORQ     DX, AX
    SUBQ    R10, AX
    MOVQ    AX, ret+32(FP)
    RET

我禁用了garbadge收集器来消除它的影响并使64B索引对齐，因为我的处理器有64B缓存线。

基准测试结果是：

Allocated 16 Kb. Avg time: 22
Allocated 32 Kb. Avg time: 22
Allocated 64 Kb. Avg time: 22
Allocated 128 Kb. Avg time: 22
Allocated 256 Kb. Avg time: 22
Allocated 512 Kb. Avg time: 23
Allocated 1024 Kb. Avg time: 23
Allocated 2048 Kb. Avg time: 24
Allocated 4096 Kb. Avg time: 25
Allocated 8192 Kb. Avg time: 29
Allocated 16384 Kb. Avg time: 31
Allocated 32768 Kb. Avg time: 33
Allocated 65536 Kb. Avg time: 34
Allocated 131072 Kb. Avg time: 34
Allocated 262144 Kb. Avg time: 35
Allocated 524288 Kb. Avg time: 35
Allocated 1048576 Kb. Avg time: 39

我多次运行这个工作台，每次运行都给我类似的结果。如果我从asm代码中删除了读写操作，那么我的所有分配都有22个周期，所以这个时间差是内存访问时间。如您所见，第一次转换为512 Kb分配大小。只有一个cpu周期，但它非常稳定。下一次改变为2 Mb。在8 Mb时有最重要的时间变化，但它仍然是4个周期，我们完全没有缓存。

经过所有测试后，我发现缓存未命中没有显着的成本。它仍然很重要，因为时间差是10-15倍，但不是我们在文章中看到的50-500倍。也许今天的记忆明显比7年前快得多？看起来很有希望=）也许在接下来的7年之后，将会有没有cpu缓存的架构。我们会看到。

编辑：正如@Leeor所提到的，RDTSC指令没有序列化行为，并且可能出现乱序执行。我改为尝试了RDTSCP指令：

<强> main_amd64.s

// func readInt64Time(slice []int64, idx int) int64
TEXT ·readInt64Time(SB),$0-40
    MOVQ    slice+0(FP), R8
    MOVQ    idx+24(FP), R9
    BYTE $0x0F; BYTE $0x01; BYTE $0xF9; // RDTSCP
    SHLQ    $32, DX
    ORQ     DX, AX
    MOVQ    AX, R10
    MOVQ    (R8)(R9*8), R11 // Read memory
    MOVQ    $0, (R8)(R9*8) // Write memory
    BYTE $0x0F; BYTE $0x01; BYTE $0xF9; // RDTSCP
    SHLQ    $32, DX
    ORQ     DX, AX
    SUBQ    R10, AX
    MOVQ    AX, ret+32(FP)
    RET

这里我对此有所改变：

Allocated 16 Kb. Avg time: 27
Allocated 32 Kb. Avg time: 27
Allocated 64 Kb. Avg time: 28
Allocated 128 Kb. Avg time: 29
Allocated 256 Kb. Avg time: 30
Allocated 512 Kb. Avg time: 34
Allocated 1024 Kb. Avg time: 42
Allocated 2048 Kb. Avg time: 55
Allocated 4096 Kb. Avg time: 120
Allocated 8192 Kb. Avg time: 167
Allocated 16384 Kb. Avg time: 173
Allocated 32768 Kb. Avg time: 189
Allocated 65536 Kb. Avg time: 201
Allocated 131072 Kb. Avg time: 215
Allocated 262144 Kb. Avg time: 224
Allocated 524288 Kb. Avg time: 242
Allocated 1048576 Kb. Avg time: 281

现在我看到cahce和RAM访问之间的巨大差异。这个时间实际上比文章低2倍，但它是可以预测的，因为内存频率是两倍。

Answer 2

这确实没有产生尝试的观察，不清楚你的基准测试究竟做了什么 - 你是否在该范围内随机访问？你在测量每次访问的访问延迟吗？

似乎您的基准测试每次测量都会产生一个不摊销的常量开销，所以您基本上测量的是函数调用时间（这是常数）。只有当内存延迟变得足够大以通过该开销时（当您以4GB访问DRAM时），实际上才开始进行有意义的测量。

您应该切换到测量整个循环的时间（超过count次迭代）并除以。

无法重现cpu cache-miss

2 个答案: