Question

我的目标是使用简单的代码来衡量（不同）缓存的效果。我关注这篇文章，特别是第20页和第21页： https://people.freebsd.org/~lstewart/articles/cpumemory.pdf

我正在使用64位Linux。 L1d缓存为32K，L2为256K，L3为25M。

这是我的代码（我使用没有标志的g ++编译此代码）：

#include <iostream>

// ***********************************
// This is for measuring CPU clocks
#if defined(__i386__)
static __inline__ unsigned long long rdtsc(void)
{
    unsigned long long int x;
    __asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));
    return x;
}
#elif defined(__x86_64__)
static __inline__ unsigned long long rdtsc(void)
{
    unsigned hi, lo;
    __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
    return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}
#endif
// ***********************************


static const int ARRAY_SIZE = 100;

struct MyStruct {
    struct MyStruct *n;
};

int main() {
    MyStruct myS[ARRAY_SIZE];
    unsigned long long cpu_checkpoint_start, cpu_checkpoint_finish;

    //  Initializing the array of structs, each element pointing to the next 
    for (int i=0; i < ARRAY_SIZE - 1; i++){
        myS[i].n = &myS[i + 1];
        for (int j = 0; j < NPAD; j++)
            myS[i].pad[j] = (long int) i;
    }
    myS[ARRAY_SIZE - 1].n = NULL;   // the last one
    for (int j = 0; j < NPAD; j++)
        myS[ARRAY_SIZE - 1].pad[j] = (long int) (ARRAY_SIZE - 1);

    // Filling the cache
    MyStruct *current = &myS[0];
    while ((current = current->n) != NULL)
        ;

    // Sequential access
    current = &myS[0];

    // For CPU usage in terms of clocks (ticks)
    cpu_start = rdtsc();

    while ((current = current->n) != NULL)
        ;

    cpu_finish = rdtsc();

    unsigned long long avg_cpu_clocks = (cpu_finish - cpu_start) / ARRAY_SIZE;

    std::cout << "Avg CPU Clocks:   " << avg_cpu_clocks << std::endl;
    return 0;
}

我有两个问题：

1-我将ARRAY_SIZE从1变为1,000,000（因此我的数组大小介于2B到2MB之间），但平均CPU时钟始终为10。

根据该PDF（第21页的图3-10），当阵列完全适合L1时，我希望得到3-5个时钟，当它超过L1和＃39时，我会得到更高的数字（9个周期）。 ;尺寸。

2-如果我将ARRAY_SIZE增加到1,000,000以上，我会得到分段错误（核心转储），这是由于堆栈溢出造成的。我的问题是使用动态分配（MyStruct *myS = new MyStruct[ARRAY_SIZE]）是否会导致任何性能损失。

Answer 1

This is my code (I compile this code with g++ with no flags)

If you don't pass <div id='att-<?php echo $rowID; ?>'></div> <script type="text/javascript"> $("#att-<?php echo $rowID; ?>").countdown({ until : <?php echo abs(time()-($show['date']+$show['time'])); ?>, layout : " in {hn} {hl}, {mn} {ml} and {sn} {sl}", onExpiry : function () { $("#att-<?php echo $rowID; ?>").html("Boot finished"); $("tr.boot-<?php echo $rowID; ?>").delay(6666).fadeOut(); } }); </script>, then -O3 will be compiled in to multiple memory accesses, not a single load instruction. By passing while ((current = current->n) != NULL), the loop will be compiled into:

-O3

This will run at 4 cycles per iteration as you are expecting.

Note that you can use the .L3: mov rax, QWORD PTR [rax] test rax, rax jne .L3 compiler intrinsic instead of inline assembly. See: Get CPU cycle count?.

测量C ++代码

1 个答案: