CPU测量(缓存未命中/命中)没有意义

时间:2015-05-16 01:00:20

标签: c++ caching cpu performancecounter cpu-cache

我使用Intel PCM进行细粒度的CPU测量。在我的代码中,我试图测量缓存效率。

基本上,我首先将一个小数组放入L1缓存(通过遍历它多次),然后我启动计时器,再次遍历数组(希望使用缓存),然后关闭计时器。

PCM告诉我,我有一个相当高的L2和L3未命中率。我还检查了rdtscp并且每个阵列操作的周期为15(远远高于访问L1缓存的4-5个周期)。

我期望的是阵列完全放在L1缓存中,我不会有很高的L1,L2和L3缺失率。

我的系统分别为L1,L2和L3分别为32K,256K和25M。 这是我的代码:

static const int ARRAY_SIZE = 16;

struct MyStruct {
    struct MyStruct *next;
    long int pad;
}; // each MyStruct is 16 bytes

int main() {
    PCM * m = PCM::getInstance();
    PCM::ErrorCode returnResult = m->program(PCM::DEFAULT_EVENTS, NULL);
    if (returnResult != PCM::Success){
        std::cerr << "Intel's PCM couldn't start" << std::endl;
        exit(1);
    }

    MyStruct *myS = new MyStruct[ARRAY_SIZE];

    // Make a sequential liked list,
    for (int i=0; i < ARRAY_SIZE - 1; i++){
        myS[i].next = &myS[i + 1];
        myS[i].pad = (long int) i;
    }
    myS[ARRAY_SIZE - 1].next = NULL;
    myS[ARRAY_SIZE - 1].pad = (long int) (ARRAY_SIZE - 1);

    // Filling the cache
    MyStruct *current;
    for (int i = 0; i < 200000; i++){
        current = &myS[0];
        while ((current = current->n) != NULL)
            current->pad += 1;
    }

    // Sequential access experiment
    current = &myS[0];
    long sum = 0;

    SystemCounterState before = getSystemCounterState();

    while ((current = current->n) != NULL) {
        sum += current->pad;
    }

    SystemCounterState after = getSystemCounterState();

    cout << "Instructions per clock: " << getIPC(before, after) << endl;
    cout << "Cycles per op: " << getCycles(before, after) / ARRAY_SIZE << endl;
    cout << "L2 Misses:     " << getL2CacheMisses(before, after) << endl;
    cout << "L2 Hits:       " << getL2CacheHits(before, after) << endl; 
    cout << "L2 hit ratio:  " << getL2CacheHitRatio(before, after) << endl;
    cout << "L3 Misses:     " << getL3CacheMisses(before_sstate,after_sstate) << endl;
    cout << "L3 Hits:       " << getL3CacheHits(before, after) << endl;
    cout << "L3 hit ratio:  " << getL3CacheHitRatio(before, after) << endl;

    cout << "Sum:   " << sum << endl;
    m->cleanup();
    return 0;
}

这是输出:

Instructions per clock: 0.408456
Cycles per op:        553074
L2 Cache Misses:      58775
L2 Cache Hits:        11371
L2 cache hit ratio:   0.162105
L3 Cache Misses:      24164
L3 Cache Hits:        34611
L3 cache hit ratio:   0.588873

修改: 我还检查了下面的代码,并且仍然得到相同的未命中率(我预计会得到几乎为零的未命中率):

SystemCounterState before = getSystemCounterState();
// this is just a comment
SystemCounterState after = getSystemCounterState();

编辑2 :正如评论建议的那样,这些结果可能是由于探查器本身的开销。所以我而不是只有一次,我改变了代码遍历数组多次(200,000,000次),以分摊探查器的开销。我的L2和L3缓存比率仍然很低(%15)。

1 个答案:

答案 0 :(得分:4)

您的系统中的所有内核似乎都会丢失l2和l3

我在这里查看PCM实现:https://github.com/erikarn/intel-pcm/blob/ecc0cf608dfd9366f4d2d9fa48dc821af1c26f33/src/cpucounters.cpp

第1407行PCM::program()的实现中的

[1]我没有看到任何将事件限制为特定进程的代码

[2]在2809行PCM::getSystemCounterState()的实现中,您可以看到事件是从系统上的所有核心收集的。所以我会尝试将进程的cpu亲和性设置为一个核心,然后只读取来自该核心的事件 - 使用此函数CoreCounterState getCoreCounterState(uint32 core)