使用C程序理解CPU缓存

时间:2013-06-16 23:13:00

标签: c performance caching time

我正在尝试使用C程序了解CPU缓存和缓存行,就像我对大多数C概念一样。我使用的程序如下。我从博客中得到了这个想法。

http://igoro.com/archive/gallery-of-processor-cache-effects/

现在我机器上以下程序的输出如下所示。这是CFLAGS =“ - g -O0 -Wall”的输出。

./cache
CPU time for loop 1 0.460000 secs.
CPU time for loop 2 (j = 8) 0.050000 secs.
CPU time for loop 2 (j = 9) 0.050000 secs.
CPU time for loop 2 (j = 10) 0.050000 secs.
CPU time for loop 2 (j = 11) 0.050000 secs.
CPU time for loop 2 (j = 12) 0.040000 secs.
CPU time for loop 2 (j = 13) 0.050000 secs.
CPU time for loop 2 (j = 14) 0.050000 secs.
CPU time for loop 2 (j = 15) 0.040000 secs.
CPU time for loop 2 (j = 16) 0.050000 secs.
CPU time for loop 2 (j = 17) 0.040000 secs.
CPU time for loop 2 (j = 18) 0.050000 secs.
CPU time for loop 2 (j = 19) 0.040000 secs.
CPU time for loop 2 (j = 20) 0.040000 secs.
CPU time for loop 2 (j = 21) 0.040000 secs.
CPU time for loop 2 (j = 22) 0.040000 secs.
CPU time for loop 2 (j = 23) 0.040000 secs.
CPU time for loop 2 (j = 24) 0.030000 secs.
CPU time for loop 2 (j = 25) 0.040000 secs.
CPU time for loop 2 (j = 26) 0.030000 secs.
CPU time for loop 2 (j = 27) 0.040000 secs.
CPU time for loop 2 (j = 28) 0.030000 secs.
CPU time for loop 2 (j = 29) 0.040000 secs.
CPU time for loop 2 (j = 30) 0.030000 secs.
CPU time for loop 2 (j = 31) 0.030000 secs.

带优化的输出(CFLAGS=-g -O3 -Wall

CPU time for loop 1 0.130000 secs.
CPU time for loop 2 (j = 8) 0.040000 secs.
CPU time for loop 2 (j = 9) 0.050000 secs.
CPU time for loop 2 (j = 10) 0.050000 secs.
CPU time for loop 2 (j = 11) 0.040000 secs.
CPU time for loop 2 (j = 12) 0.040000 secs.
CPU time for loop 2 (j = 13) 0.050000 secs.
CPU time for loop 2 (j = 14) 0.050000 secs.
CPU time for loop 2 (j = 15) 0.040000 secs.
CPU time for loop 2 (j = 16) 0.040000 secs.
CPU time for loop 2 (j = 17) 0.050000 secs.
CPU time for loop 2 (j = 18) 0.040000 secs.
CPU time for loop 2 (j = 19) 0.050000 secs.
CPU time for loop 2 (j = 20) 0.040000 secs.
CPU time for loop 2 (j = 21) 0.040000 secs.
CPU time for loop 2 (j = 22) 0.040000 secs.
CPU time for loop 2 (j = 23) 0.030000 secs.
CPU time for loop 2 (j = 24) 0.040000 secs.
CPU time for loop 2 (j = 25) 0.030000 secs.
CPU time for loop 2 (j = 26) 0.040000 secs.
CPU time for loop 2 (j = 27) 0.030000 secs.
CPU time for loop 2 (j = 28) 0.030000 secs.
CPU time for loop 2 (j = 29) 0.030000 secs.
CPU time for loop 2 (j = 30) 0.030000 secs.
CPU time for loop 2 (j = 31) 0.030000 secs.

在博客中指出

  

第一个循环将数组中的每个值乘以3,第二个循环乘以>只有每16个。第二个循环只有    大约6%的第一个循环的工作,但在现代机器上,    两个for循环大约需要相同的时间:分别为80和78 ms    我的机器。

在我的机器上似乎不是这种情况。如您所见,执行

的时间
loop 1 is 0.46 seconds.

loop 2 is 0.03 seconds or 0.04 seconds or 0.05 seconds

表示j的不同值。

为什么会这样?

#include <stdio.h>
#include <sys/time.h>
#include <time.h>
#include <unistd.h>
#include <stdlib.h>

#define MAX_SIZE (64*1024*1024)

int main()
{
    clock_t start, end;
    double cpu_time;
    int i = 0;
    int j = 0;
    /* MAX_SIZE array is too big for stack. This is an unfortunate rough edge of the way the stack works.
     It lives in a fixed-size buffer, set by the program executable's configuration according to the
     operating system, but its actual size is seldom checked against the available space. */
    /* int arr[MAX_SIZE]; */

    int *arr = (int*)malloc(MAX_SIZE * sizeof(int));

    /* CPU clock ticks count start */
    start = clock();

    /* Loop 1 */
    for (i = 0; i < MAX_SIZE; i++)
        arr[i] *= 3;

    /* CPU clock ticks count stop */
    end = clock();

    cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;

    printf("CPU time for loop 1 %.6f secs.\n", cpu_time);

    for (j = 8 ; j < 32 ; j++)
    {
        /* CPU clock ticks count start */
        start = clock();

        /* Loop 2 */
        for (i = 0; i < MAX_SIZE; i += j)
            arr[i] *= 3;

        /* CPU clock ticks count stop*/
        end = clock();

        cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;

        printf("CPU time for loop 2 (j = %d) %.6f secs.\n", j, cpu_time);
    }

    return 0;
}

1 个答案:

答案 0 :(得分:3)

我稍微修改了一下代码。首先是修改摘要:

  1. 使MAX_SIZE显着增大,以确保在事情发生变化时存在真正的差异。 (它现在使用完整的2 GB内存,因此不要在32位操作系统上执行此操作)
  2. 运行循环1几次(在我的机器上,这会产生一些不同,因为我的机器第一次运行会变慢 - 这可能是因为malloc实际上并没有将内存映射到进程'地址空间,所以在第一个循环中,我们得到一些额外的开销用于在内存中进行映射)。它还确保CPU在执行其他循环时以“全速”运行,而不是“省电”速度。
  3. 在第二个循环中通过乘以2更快地改变j值(<<= 1*= 2相同 - 在这种情况下 - 使用移位的旧习惯)
  4. 使用+= 3代替*= 3。 (乘以比+ =慢一点,但在这种情况下差别不大。
  5. 添加一个loop3,它与loop2执行的操作数完全相同,但是在一个较小的内存范围内[使用&并使用2 n -1值来限制范围]。
  6. 我用gcc -Wall -O3 -sdc=c99编译代码,使用版本4.6.3并在四核Athlon 965,Fedora Core 16 x86-64和16 GB RAM上运行。

    以下是代码:

    #include <stdio.h>
    #include <sys/time.h>
    #include <time.h>
    #include <unistd.h>
    #include <stdlib.h>
    
    #define MAX_SIZE (512*1024*1024)
    
    int main()
    {
        clock_t start, end;
        double cpu_time;
        int i = 0;
        int j = 0;
        /* MAX_SIZE array is too big for stack.This is an unfortunate rough edge of the way the stack works.
           It lives in a fixed-size buffer, set by the program executable's configuration according to the
           operating system, but its actual size is seldom checked against the available space. */
        /* int arr[MAX_SIZE]; */
    
        int *arr = (int*)malloc(MAX_SIZE * sizeof(int));
    
        /* CPU clock ticks count start */
    
        for(int k = 0; k < 3; k++)
        {
            start = clock();
    
            /* Loop 1 */
            for (i = 0; i < MAX_SIZE; i++)
                arr[i] += 3;
    
            /* CPU clock ticks count stop */
            end = clock();
    
            cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;
    
            printf("CPU time for loop 1 %.6f secs.\n", cpu_time);
        }
    
        for (j = 1 ; j <= 1024 ; j <<= 1)
        {
            /* CPU clock ticks count start */
            start = clock();
    
            /* Loop 2 */
            for (i = 0; i < MAX_SIZE; i += j)
                arr[i] += 3;
    
            /* CPU clock ticks count stop */
            end = clock();
    
            cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;
    
            printf("CPU time for loop 2 (j = %d) %.6f secs.\n", j, cpu_time);
        }
    
    
        // Third loop, performing the same operations as loop 2,
        // but only touching 16KB of memory
        for (j = 1 ; j <= 1024 ; j <<= 1)
        {
            /* CPU clock ticks count start */
            start = clock();
    
            /* Loop 3 */
            for (i = 0; i < MAX_SIZE; i += j)
                arr[i & 0xfff] += 3;
    
            /* CPU clock ticks count stop */
            end = clock();
    
            cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;
    
            printf("CPU time for loop 3 (j = %d) %.6f secs.\n", j, cpu_time);
        }
        return 0;
    }
    

    结果:

    CPU time for loop 1 2.950000 secs.
    CPU time for loop 1 0.630000 secs.
    CPU time for loop 1 0.630000 secs.
    CPU time for loop 2 (j = 1) 0.780000 secs.
    CPU time for loop 2 (j = 2) 0.700000 secs.
    CPU time for loop 2 (j = 4) 0.610000 secs.
    CPU time for loop 2 (j = 8) 0.540000 secs.
    CPU time for loop 2 (j = 16) 0.560000 secs.
    CPU time for loop 2 (j = 32) 0.280000 secs.
    CPU time for loop 2 (j = 64) 0.140000 secs.
    CPU time for loop 2 (j = 128) 0.090000 secs.
    CPU time for loop 2 (j = 256) 0.060000 secs.
    CPU time for loop 2 (j = 512) 0.030000 secs.
    CPU time for loop 2 (j = 1024) 0.040000 secs.
    CPU time for loop 3 (j = 1) 0.470000 secs.
    CPU time for loop 3 (j = 2) 0.240000 secs.
    CPU time for loop 3 (j = 4) 0.120000 secs.
    CPU time for loop 3 (j = 8) 0.050000 secs.
    CPU time for loop 3 (j = 16) 0.030000 secs.
    CPU time for loop 3 (j = 32) 0.020000 secs.
    CPU time for loop 3 (j = 64) 0.010000 secs.
    CPU time for loop 3 (j = 128) 0.000000 secs.
    CPU time for loop 3 (j = 256) 0.000000 secs.
    CPU time for loop 3 (j = 512) 0.000000 secs.
    CPU time for loop 3 (j = 1024) 0.000000 secs.
    

    正如您所看到的,loop2的前几个花费相同的时间 - 一旦我们达到32,时间开始下降,因为处理器不需要每个缓存行,但是在loop3情况下,每个循环中的操作数会直接影响总时间。

    编辑:

    乘法(*=3)vs add(+=3)实际上没有那么大的差异,除了loop3情况,它增加了大约30%的循环时间。