查看缓存missess - 简单的C ++缓存基准

时间:2012-02-23 11:56:17

标签: c++ performance caching

我想做一个简单的测试,看看有没有缓存未命中的性能差异。

我想看到当在阵列X(X适合缓存)上运行时,性能优于阵列Y(Y不适合缓存)。实际上,当缓存未命中时,我想要发现数组的临界大小。

我做了一个简单的函数来访问循环中的数组。我应该为适合缓存的performance获取一些arr_size,为arr_size获取不适合缓存的其他arr_size。但是我的// compiled without optimizations -O0 float benchmark_cache(const size_t arr_size) { unsigned char* arr_a = (unsigned char*) malloc(sizeof(char) * arr_size); unsigned char* arr_b = (unsigned char*) malloc(sizeof(char) * arr_size); assert( arr_a ); assert( arr_b ); long time0 = get_nsec(); for( size_t i = 0; i < arr_size; ++i ) { // index k will jump forth and back, to generate cache misses size_t k = (i / 2) + (i % 2) * arr_size / 2; arr_b[k] = arr_a[k] + 1; } long time_d = get_nsec() - time0; float performance = float(time_d) / arr_size; printf("perf %.1f [kB]: %d\n", performance, arr_size /1024 ); free(arr_a); free(arr_b); return performance; } long get_nsec() { timespec ts; clock_gettime(CLOCK_REALTIME, &ts); return long(ts.tv_sec)*1000*1000 + ts.tv_nsec; } 独立性能更低,即使对于大尺寸(如20MB)也是如此。那是为什么?

{{1}}

4 个答案:

答案 0 :(得分:1)

很难准确地说出来,但我的猜测是CPU的预测性和线性负载对你有所帮助。也就是说,由于您按顺序访问数据,因此当您达到未缓存的值时,CPU将加载下一个数据块。这种加载基本上可以并行完成,因此您可能无法真正等待负载。

我知道你正试图跳转,但读/写顺序仍然是非常线性的。您只需迭代两个块而不是1.尝试使用便宜的随机数生成器跳过更多。

另请注意,%操作相对较慢,因此您可能会无意中测量该性能。不使用优化进行编译意味着它可能在实际上使用mod运算符,而不是在这里使用掩码。尝试在完全优化的情况下进行测试。

另外,请确保将线程设置为具有实时优先级的固定cpu关联(如何执行此操作取决于您的操作系统)。这应该限制任何上下文切换开销。

答案 1 :(得分:1)

当您访问多个相同位置而不访问其间太多其他位置时,会发生缓存性能提升。在这里,您只需访问已分配的内存一次,就不会看到很多缓存效果。

即使您将代码更改为几次访问整个数组,缓存处理逻辑也会尝试预测您的访问权限,如果模式足够简单,通常会成功。线性前向访问(甚至分成两部分)非常简单。

答案 2 :(得分:0)

您可能应该在缓存模拟模式中使用cachegrind之类的工具来获得合理的结果。否则,缓存性能会受到调度程序工作导致的上下文切换的显着影响。

答案 3 :(得分:0)

我刚刚阅读what should I know about memory并参与了基准测试示例。希望,这有助于某人:

struct TimeLogger
{
    const char*   m_blockName;
    const clock_t m_start;

    TimeLogger(const char* blockName) : m_blockName(blockName), m_start(clock()) {}
    ~TimeLogger()                     
    {
        clock_t finish = clock();
        std::cout << "Done: " << m_blockName << " in " << (finish - m_start) * 1000.0 / CLOCKS_PER_SEC << " ms" << std::endl;
    }
};

const size_t k_ITERATIONS = 16;
const size_t k_SIZE = 1024 * 1024 * 16;

uint64_t test(const char* name, const std::vector<int64_t>& data, const std::vector<size_t>& indexes)
{
    TimeLogger log = name;

    uint64_t sum = 0;
    for (size_t i = 0; i < k_ITERATIONS; ++i)
        for (size_t index : indexes)
            sum += data[index];

    return sum;
}

// return shuffled sequences of consecutive numbers like [0,1,2, 6,7,8, 3,4,5, ...]
std::vector<size_t> fillSequences(size_t size, size_t seriesSize, std::mt19937 g)
{
    std::vector<size_t> semiRandIdx; 
    semiRandIdx.reserve(size);

    size_t i = 0;
    auto semiRandSequences = std::vector<size_t>(size / seriesSize, 0);
    std::generate(semiRandSequences.begin(), semiRandSequences.end(), [&i]() { return i++; });
    std::shuffle(semiRandSequences.begin(), semiRandSequences.end(), g);

    for (size_t seqNumber : semiRandSequences)
        for (size_t i = seqNumber * seriesSize; i < (seqNumber + 1) * seriesSize; ++i)
            semiRandIdx.push_back(i);

    return semiRandIdx;
}

int main()
{
    std::random_device rd;
    std::mt19937 g(rd());

    auto intData = std::vector<int64_t>(k_SIZE, 0);
    std::generate(intData.begin(), intData.end(), g);

    // [0, 1, 2, ... N]
    auto idx = std::vector<size_t>(k_SIZE, 0);
    std::generate(idx.begin(), idx.end(), []() {static size_t i = 0; return i++; });

    // [N, N-1, ... 0]
    auto reverseIdx = std::vector<size_t>(idx.rbegin(), idx.rend());

    // random permutation of [0, 1, ... N]
    auto randIdx = idx;
    std::shuffle(randIdx.begin(), randIdx.end(), g);

    // random permutations of 32, 64, 128-byte sequences
    auto seq32Idx  = fillSequences(idx.size(), 32  / sizeof(int64_t), g);
    auto seq64Idx  = fillSequences(idx.size(), 64  / sizeof(int64_t), g);
    auto seq128Idx = fillSequences(idx.size(), 128 / sizeof(int64_t), g);

    size_t dataSize  = intData.size() * sizeof(int64_t);
    size_t indexSize = idx.size() * sizeof(int64_t);
    std::cout << "vectors filled, data (MB): " << dataSize / 1024 / 1024.0 << "; index (MB): " << indexSize / 1024 / 1024.0
        << "; total (MB): " << (dataSize + indexSize) / 1024 / 1024.0 << std::endl << "Loops: " << k_ITERATIONS << std::endl;

    uint64_t sum1 = test("regular access", intData, idx);
    uint64_t sum2 = test("reverse access", intData, reverseIdx);
    uint64_t sum3 = test("random access", intData, randIdx);
    uint64_t sum4 = test("random 32-byte sequences", intData, seq32Idx);
    uint64_t sum5 = test("random 64-byte sequences", intData, seq64Idx);
    uint64_t sum6 = test("random 128-byte sequences", intData, seq128Idx);

    std::cout << sum1 << ", " << sum2 << ", " << sum3 << ", " << sum4 << ", " << sum5 << ", " << sum6 << std::endl;
    return 0;
} 
好奇的是CPU的预取器大大优化了反向阵列访问。我比较前向访问时间与反向访问时发现了这一点:在我的PC上性能是相同的。

以下是笔记本电脑的一些结果,包括2x32KB L1,2x256KB L2和3MB L3缓存:

vectors filled, data (MB): 512; index (MB): 512; total (MB): 1024
Loops: 1
Done: regular access in 147 ms
Done: reverse access in 119 ms
Done: random access in 2943 ms
Done: random 32-byte sequences in 938 ms
Done: random 64-byte sequences in 618 ms
Done: random 128-byte sequences in 495 ms

...

vectors filled, data (MB): 4; index (MB): 4; total (MB): 8
Loops: 512
Done: regular access in 331 ms
Done: reverse access in 334 ms
Done: random access in 1961 ms
Done: random 32-byte sequences in 1099 ms
Done: random 64-byte sequences in 930 ms
Done: random 128-byte sequences in 824 ms

...

vectors filled, data (MB): 1; index (MB): 1; total (MB): 2
Loops: 2048
Done: regular access in 174 ms
Done: reverse access in 162 ms
Done: random access in 490 ms
Done: random 32-byte sequences in 318 ms
Done: random 64-byte sequences in 295 ms
Done: random 128-byte sequences in 257 ms

... 

vectors filled, data (MB): 0.125; index (MB): 0.125; total (MB): 0.25
Loops: 16384
Done: regular access in 148 ms
Done: reverse access in 139 ms
Done: random access in 210 ms
Done: random 32-byte sequences in 179 ms
Done: random 64-byte sequences in 166 ms
Done: random 128-byte sequences in 163 ms