Question

我有3个内存块。

char block_a[1600]; // Initialized with random chars
unsigned short block_b[1600]; // Initialized with random shorts 0 - 1599 with no duplication
char block_c[1600]; // Initialized with 0

我正在对此

执行以下复制操作

for ( int i = 0; i < 1600; i++ ) {
    memcpy(block_c[i], block_a[block_b[i]], sizeof(block_a[0]); // Point # 1
}

现在我正在尝试测量上述操作的NS周期+时间，我在第1点进行操作。

测量环境

1）平台：Intel x86-64。核心i7
2）Linux内核3.8

测量算法

0）实现作为内核模块完成，以便我可以完全控制和精确数据 1）测量CPUID + MOV指令的开销，我将用于序列化。
2）禁用抢占+中断以获得CPU的独占访问权限 3）调用CPUID以确保管道在此之前没有乱序指令
4）调用RDTSC获取TSC的初始值并保存该值
5）执行我想测量的操作，我在上面提到了 6）调用RDTSCP获取TSC的结束值并保存该值
7）再次调用CPUID以确保没有任何内容以无序方式进入我们的两个RDTSC调用 8）从开始TSC值减去结束TSC值以获得执行该操作的CPU周期
9）减去2个MOVE指令所采用的开销周期，以获得最终的CPU周期。

码

    ....
    ....
    preempt_disable(); /* Disable preemption to avoid scheduling */
    raw_local_irq_save(flags); /* Disable the hard interrupts */
    /* CPU is ours now */
    __asm__ volatile (
        "CPUID\n\t"
        "RDTSC\n\t"
        "MOV %%EDX, %0\n\t"
        "MOV %%EAX, %1\n\t": "=r" (cycles_high_start), "=r" (cycles_low_start)::
        "%rax", "%rbx", "%rcx", "%rdx"
    );

    /*
     Measuring Point Start
    */
    memcpy(&shuffled_byte_array[idx], &random_byte_array[random_byte_seed[idx]], sizeof(random_byte_array[0]));
    /* 
    * Measuring Point End
    */
    __asm__ volatile (
        "RDTSCP\n\t"
        "MOV %%EDX, %0\n\t"
        "MOV %%EAX, %1\n\t"
        "CPUID\n\t": "=r" (cycles_high_end), "=r" (cycles_low_end)::
        "%rax", "%rbx", "%rcx", "%rdx"
    );

    /* Release CPU */
    raw_local_irq_restore(flags);
    preempt_enable();

    start = ( ((uint64_t)cycles_high_start << 32) | cycles_low_start);
    end   = ( ((uint64_t)cycles_high_end << 32) | cycles_low_end);
    if ( (end-start) >= overhead_cycles ) {
        total = ( (end-start) - overhead_cycles);
    } else {
        // We will consdider last total
    }

题

我得到的CPU周期测量似乎并不现实。给出了一些样品的结果

Cycles Time(NS)
0006 0005
0006 0005
0006 0005
0006 0005
0006 0005
0011 0009
0006 0005
0006 0005
0006 0005
0006 0005
0006 0005
0011 0009
0011 0009
0000 0000
0011 0009
0006 0005
0006 0005
0006 0005
0011 0009
0006 0005
0000 0000
0011 0009
0011 0009
0006 0005
0006 0005
0006 0005
0006 0005
0006 0005
0011 0009
0006 0005
0011 0009
0011 0009
0011 0009
0011 0009
0006 0005
0006 0005
0006 0005
0006 0005
0011 0009
0011 0009
0011 0009

如果我再次加载我的模块，请给出结果。

Cycles Time(NS)
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0006 0005
0006 0005
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0011 0009
0011 0009
0011 0009
0011 0009
0011 0009
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0017 0014
0011 0009
0011 0009
0000 0000
0000 0000
0000 0000
0011 0009
0000 0000
0000 0000
0011 0009
0011 0009
0011 0009
0000 0000
0022 0018
0006 0005
0011 0009
0006 0005
0006 0005
0104 0086
0104 0086
0011 0009
0011 0009
0011 0009
0006 0005
0006 0005
0017 0014
0017 0014
0022 0018
0022 0018
0022 0018
0017 0014
0011 0009
0022 0018
0011 0009
0006 0005
0011 0009
0006 0005
0006 0005
0006 0005
0011 0009
0011 0009
0011 0009
0011 0009
0011 0009
0006 0005
0006 0005
0011 0009
0006 0005
0022 0018
0011 0009
0028 0023
0006 0005
0006 0005
0022 0018
0006 0005
0022 0018
0006 0005
0011 0009
0006 0005
0011 0009
0006 0005
0000 0000
0006 0005
0017 0014
0011 0009
0022 0018
0000 0000
0011 0009
0006 0005
0011 0009
0022 0018
0006 0005
0022 0018
0011 0009
0022 0018
0022 0018
0011 0009
0006 0005
0011 0009
0011 0009
0006 0005
0011 0009
0126 0105
0006 0005
0022 0018
0000 0000
0022 0018
0006 0005
0017 0014
0011 0009
0022 0018
0011 0009
0006 0005
0006 0005
0011 0009

在上面的列表中，您会注意到有许多复制操作，我有0个CPU周期。我多次看到＆lt; 3个周期。

您认为获得0个CPU周期的原因是什么，或者对于memcpy操作来说很少？知道memcpy通常会占用多少CPU周期。

更新

我已经尝试并获得了结果 1）循环时间0 - 8如果我在重新启动后使用memcpy复制单个字节 2）循环时间0，如果我在重新启动后使用memcpy复制完整块 3）BIOS更改为单核（虽然此代码仅在单核上运行，但只是为了确保），对结果没有影响
4）尽管解决了这个问题，但禁用Intel SpeedStep的BIOS更改无效，为了获得最大可能的CPU周期，应禁用Intel SpeedStep以使CPU以最大频率工作。

Answer 1

看起来高速缓存是CPU周期不正确的原因（实际上它不是不正确的CPU周期，但在这种情况下也应考虑高速缓存性能测量以获得准确的结果）。确保给定数据的缓存清晰后，我的结果看起来很好。我添加了以下功能来清除缓存。 clflush函数在内核API中可用，它使用x86 CLFLUSH指令。

static void flush_cache(char random_byte_array[], char shuffled_byte_array[])
{
    unsigned int idx = 0;
    for ( idx = 0; idx < (MEM_BLOCK_SIZE/64); idx++ ) {
        clflush(random_byte_array+(idx*64));
    }
    for ( idx = 0; idx < (MEM_BLOCK_SIZE/64); idx++ ) {
        clflush(shuffled_byte_array+(idx*64));
    }
}

结果

memcpy在1600字节的完整内存块上 CPU Cycles = 216 - 260（对于多个测试＆gt;

1600字节块的单个字节的memcpy

Cycles Time (ns)
0159 0132
0000 0000
0000 0000
....
....
0049 0040
0049 0040
0049 0040
0000 0000
0000 0000
....
....

对于第一个元素（第0个索引）的memcpy，它需要大约140-160个周期，因为进行一些元素需要0-10个周期，（这是因为我猜数据是在缓存中加载的），在更多元素之后它需要大约140-160个元素（可能发生缓存未命中）

只要数据不在缓存中，我就会获得良好的CPU周期，但只要数据在缓存中，周期就不足以测量，可能还应考虑缓存性能测量。

测量memcpy在x86-64上的性能

1 个答案: