我正在尝试使用C程序了解CPU缓存和缓存行,就像我对大多数C概念一样。我使用的程序如下。我从博客中得到了这个想法。
http://igoro.com/archive/gallery-of-processor-cache-effects/
现在我机器上以下程序的输出如下所示。这是CFLAGS =“ - g -O0 -Wall”的输出。
./cache
CPU time for loop 1 0.460000 secs.
CPU time for loop 2 (j = 8) 0.050000 secs.
CPU time for loop 2 (j = 9) 0.050000 secs.
CPU time for loop 2 (j = 10) 0.050000 secs.
CPU time for loop 2 (j = 11) 0.050000 secs.
CPU time for loop 2 (j = 12) 0.040000 secs.
CPU time for loop 2 (j = 13) 0.050000 secs.
CPU time for loop 2 (j = 14) 0.050000 secs.
CPU time for loop 2 (j = 15) 0.040000 secs.
CPU time for loop 2 (j = 16) 0.050000 secs.
CPU time for loop 2 (j = 17) 0.040000 secs.
CPU time for loop 2 (j = 18) 0.050000 secs.
CPU time for loop 2 (j = 19) 0.040000 secs.
CPU time for loop 2 (j = 20) 0.040000 secs.
CPU time for loop 2 (j = 21) 0.040000 secs.
CPU time for loop 2 (j = 22) 0.040000 secs.
CPU time for loop 2 (j = 23) 0.040000 secs.
CPU time for loop 2 (j = 24) 0.030000 secs.
CPU time for loop 2 (j = 25) 0.040000 secs.
CPU time for loop 2 (j = 26) 0.030000 secs.
CPU time for loop 2 (j = 27) 0.040000 secs.
CPU time for loop 2 (j = 28) 0.030000 secs.
CPU time for loop 2 (j = 29) 0.040000 secs.
CPU time for loop 2 (j = 30) 0.030000 secs.
CPU time for loop 2 (j = 31) 0.030000 secs.
带优化的输出(CFLAGS=-g -O3 -Wall
)
CPU time for loop 1 0.130000 secs.
CPU time for loop 2 (j = 8) 0.040000 secs.
CPU time for loop 2 (j = 9) 0.050000 secs.
CPU time for loop 2 (j = 10) 0.050000 secs.
CPU time for loop 2 (j = 11) 0.040000 secs.
CPU time for loop 2 (j = 12) 0.040000 secs.
CPU time for loop 2 (j = 13) 0.050000 secs.
CPU time for loop 2 (j = 14) 0.050000 secs.
CPU time for loop 2 (j = 15) 0.040000 secs.
CPU time for loop 2 (j = 16) 0.040000 secs.
CPU time for loop 2 (j = 17) 0.050000 secs.
CPU time for loop 2 (j = 18) 0.040000 secs.
CPU time for loop 2 (j = 19) 0.050000 secs.
CPU time for loop 2 (j = 20) 0.040000 secs.
CPU time for loop 2 (j = 21) 0.040000 secs.
CPU time for loop 2 (j = 22) 0.040000 secs.
CPU time for loop 2 (j = 23) 0.030000 secs.
CPU time for loop 2 (j = 24) 0.040000 secs.
CPU time for loop 2 (j = 25) 0.030000 secs.
CPU time for loop 2 (j = 26) 0.040000 secs.
CPU time for loop 2 (j = 27) 0.030000 secs.
CPU time for loop 2 (j = 28) 0.030000 secs.
CPU time for loop 2 (j = 29) 0.030000 secs.
CPU time for loop 2 (j = 30) 0.030000 secs.
CPU time for loop 2 (j = 31) 0.030000 secs.
在博客中指出
第一个循环将数组中的每个值乘以3,第二个循环乘以>只有每16个。第二个循环只有 大约6%的第一个循环的工作,但在现代机器上, 两个for循环大约需要相同的时间:分别为80和78 ms 我的机器。
在我的机器上似乎不是这种情况。如您所见,执行
的时间loop 1 is 0.46 seconds.
和
loop 2 is 0.03 seconds or 0.04 seconds or 0.05 seconds
表示j的不同值。
为什么会这样?
#include <stdio.h>
#include <sys/time.h>
#include <time.h>
#include <unistd.h>
#include <stdlib.h>
#define MAX_SIZE (64*1024*1024)
int main()
{
clock_t start, end;
double cpu_time;
int i = 0;
int j = 0;
/* MAX_SIZE array is too big for stack. This is an unfortunate rough edge of the way the stack works.
It lives in a fixed-size buffer, set by the program executable's configuration according to the
operating system, but its actual size is seldom checked against the available space. */
/* int arr[MAX_SIZE]; */
int *arr = (int*)malloc(MAX_SIZE * sizeof(int));
/* CPU clock ticks count start */
start = clock();
/* Loop 1 */
for (i = 0; i < MAX_SIZE; i++)
arr[i] *= 3;
/* CPU clock ticks count stop */
end = clock();
cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;
printf("CPU time for loop 1 %.6f secs.\n", cpu_time);
for (j = 8 ; j < 32 ; j++)
{
/* CPU clock ticks count start */
start = clock();
/* Loop 2 */
for (i = 0; i < MAX_SIZE; i += j)
arr[i] *= 3;
/* CPU clock ticks count stop*/
end = clock();
cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;
printf("CPU time for loop 2 (j = %d) %.6f secs.\n", j, cpu_time);
}
return 0;
}
答案 0 :(得分:3)
我稍微修改了一下代码。首先是修改摘要:
malloc
实际上并没有将内存映射到进程'地址空间,所以在第一个循环中,我们得到一些额外的开销用于在内存中进行映射)。它还确保CPU在执行其他循环时以“全速”运行,而不是“省电”速度。j
值(<<= 1
与*= 2
相同 - 在这种情况下 - 使用移位的旧习惯)+= 3
代替*= 3
。 (乘以比+ =慢一点,但在这种情况下差别不大。loop3
,它与loop2执行的操作数完全相同,但是在一个较小的内存范围内[使用&
并使用2 n -1值来限制范围]。我用gcc -Wall -O3 -sdc=c99
编译代码,使用版本4.6.3并在四核Athlon 965,Fedora Core 16 x86-64和16 GB RAM上运行。
以下是代码:
#include <stdio.h>
#include <sys/time.h>
#include <time.h>
#include <unistd.h>
#include <stdlib.h>
#define MAX_SIZE (512*1024*1024)
int main()
{
clock_t start, end;
double cpu_time;
int i = 0;
int j = 0;
/* MAX_SIZE array is too big for stack.This is an unfortunate rough edge of the way the stack works.
It lives in a fixed-size buffer, set by the program executable's configuration according to the
operating system, but its actual size is seldom checked against the available space. */
/* int arr[MAX_SIZE]; */
int *arr = (int*)malloc(MAX_SIZE * sizeof(int));
/* CPU clock ticks count start */
for(int k = 0; k < 3; k++)
{
start = clock();
/* Loop 1 */
for (i = 0; i < MAX_SIZE; i++)
arr[i] += 3;
/* CPU clock ticks count stop */
end = clock();
cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;
printf("CPU time for loop 1 %.6f secs.\n", cpu_time);
}
for (j = 1 ; j <= 1024 ; j <<= 1)
{
/* CPU clock ticks count start */
start = clock();
/* Loop 2 */
for (i = 0; i < MAX_SIZE; i += j)
arr[i] += 3;
/* CPU clock ticks count stop */
end = clock();
cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;
printf("CPU time for loop 2 (j = %d) %.6f secs.\n", j, cpu_time);
}
// Third loop, performing the same operations as loop 2,
// but only touching 16KB of memory
for (j = 1 ; j <= 1024 ; j <<= 1)
{
/* CPU clock ticks count start */
start = clock();
/* Loop 3 */
for (i = 0; i < MAX_SIZE; i += j)
arr[i & 0xfff] += 3;
/* CPU clock ticks count stop */
end = clock();
cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;
printf("CPU time for loop 3 (j = %d) %.6f secs.\n", j, cpu_time);
}
return 0;
}
结果:
CPU time for loop 1 2.950000 secs.
CPU time for loop 1 0.630000 secs.
CPU time for loop 1 0.630000 secs.
CPU time for loop 2 (j = 1) 0.780000 secs.
CPU time for loop 2 (j = 2) 0.700000 secs.
CPU time for loop 2 (j = 4) 0.610000 secs.
CPU time for loop 2 (j = 8) 0.540000 secs.
CPU time for loop 2 (j = 16) 0.560000 secs.
CPU time for loop 2 (j = 32) 0.280000 secs.
CPU time for loop 2 (j = 64) 0.140000 secs.
CPU time for loop 2 (j = 128) 0.090000 secs.
CPU time for loop 2 (j = 256) 0.060000 secs.
CPU time for loop 2 (j = 512) 0.030000 secs.
CPU time for loop 2 (j = 1024) 0.040000 secs.
CPU time for loop 3 (j = 1) 0.470000 secs.
CPU time for loop 3 (j = 2) 0.240000 secs.
CPU time for loop 3 (j = 4) 0.120000 secs.
CPU time for loop 3 (j = 8) 0.050000 secs.
CPU time for loop 3 (j = 16) 0.030000 secs.
CPU time for loop 3 (j = 32) 0.020000 secs.
CPU time for loop 3 (j = 64) 0.010000 secs.
CPU time for loop 3 (j = 128) 0.000000 secs.
CPU time for loop 3 (j = 256) 0.000000 secs.
CPU time for loop 3 (j = 512) 0.000000 secs.
CPU time for loop 3 (j = 1024) 0.000000 secs.
正如您所看到的,loop2
的前几个花费相同的时间 - 一旦我们达到32,时间开始下降,因为处理器不需要每个缓存行,但是在loop3
情况下,每个循环中的操作数会直接影响总时间。
编辑:
乘法(*=3
)vs add(+=3
)实际上没有那么大的差异,除了loop3情况,它增加了大约30%的循环时间。