我读了一篇文章(1.5岁http://www.drdobbs.com/parallel/cache-friendly-code-solving-manycores-ne/240012736),其中讨论了缓存性能和数据大小。他们展示了以下代码,他们说这些代码是在i7(沙桥)上运行的
static volatile int array[Size];
static void test_function(void)
{
for (int i = 0; i < Iterations; i++)
for (int x = 0; x < Size; x++)
array[x]++;
}
他们声称,如果他们保持Size * Iterations不变,增加Size,当数组内存中的大小增加超过L2缓存大小时,他们会观察到执行时间的大幅增加(10x)。
作为我自己的练习,我想尝试一下,看看我是否可以为我的机器重现结果。 (i7 3770k,win7,visual c ++ 2012编译器,Win32调试模式,未启用优化)。令我惊讶的是,我无法看到执行所花费的时间增加(甚至超过L3缓存大小),这让我觉得编译器在某种程度上优化了这段代码。但我也没有看到任何优化。我看到的唯一的速度变化是,在我的机器的字大小以下,它需要稍长。以下是我的时间,代码清单和相关的反汇编。
有谁知道原因:
1)无论阵列的大小如何,为什么所用的时间都不会增加?或者我怎么知道?
2)为什么花费的时间从高处开始然后减小直到达到缓存行大小,如果数据小于行大小,是否应该处理更多的迭代而不从缓存中读取? /强>
时序:
Size=1,Iterations=1073741824, Time=3829
Size=2,Iterations=536870912, Time=2625
Size=4,Iterations=268435456, Time=2563
Size=16,Iterations=67108864, Time=2906
Size=32,Iterations=33554432, Time=3469
Size=64,Iterations=16777216, Time=3250
Size=256,Iterations=4194304, Time=3140
Size=1024,Iterations=1048576, Time=3110
Size=2048,Iterations=524288, Time=3187
Size=4096,Iterations=262144, Time=3078
Size=8192,Iterations=131072, Time=3125
Size=16384,Iterations=65536, Time=3109
Size=32768,Iterations=32768, Time=3078
Size=65536,Iterations=16384, Time=3078
Size=262144,Iterations=4096, Time=3172
Size=524288,Iterations=2048, Time=3109
Size=1048576,Iterations=1024, Time=3094
Size=2097152,Iterations=512, Time=3313
Size=4194304,Iterations=256, Time=3391
Size=8388608,Iterations=128, Time=3312
Size=33554432,Iterations=32, Time=3109
Size=134217728,Iterations=8, Time=3515
Size=536870912,Iterations=2, Time=3532
代码:
#include <string>
#include <cassert>
#include <windows.h>
template <unsigned int SIZE, unsigned int ITERATIONS>
static void test_body(volatile char* array)
{
for (unsigned int i = 0; i < ITERATIONS; i++)
{
for (unsigned int x = 0; x < SIZE; x++)
{
array[x]++;
}
}
}
template <unsigned int SIZE, unsigned int ITERATIONS>
static void test_function()
{
assert(SIZE*ITERATIONS == 1024*1024*1024);
static volatile char array[SIZE];
test_body<SIZE, 1>(array); //warmup
DWORD beginTime = GetTickCount();
test_body<SIZE, ITERATIONS>(array);
DWORD endTime= GetTickCount();
printf("Size=%u,Iterations=%u, Time=%d\n", SIZE,ITERATIONS, endTime-beginTime);
}
int main()
{
enum { eIterations= 1024*1024*1024};
test_function<1, eIterations>();
test_function<2, eIterations/2>();
test_function<4, eIterations/4>();
test_function<16, eIterations/16>();
test_function<32, eIterations/ 32>();
test_function<64, eIterations/ 64>();
test_function<256, eIterations/ 256>();
test_function<1024, eIterations/ 1024>();
test_function<2048, eIterations/ 2048>();
test_function<4096, eIterations/ 4096>();
test_function<8192, eIterations/ 8192>();
test_function<16384, eIterations/ 16384>();
test_function<32768, eIterations/ 32768>();
test_function<65536, eIterations/ 65536>();
test_function<262144, eIterations/ 262144>();
test_function<524288, eIterations/ 524288>();
test_function<1048576, eIterations/ 1048576>();
test_function<2097152, eIterations/ 2097152>();
test_function<4194304, eIterations/ 4194304>();
test_function<8388608, eIterations/ 8388608>();
test_function<33554432, eIterations/ 33554432>();
test_function<134217728, eIterations/ 134217728>();
test_function<536870912, eIterations/ 536870912>();
}
拆卸
for (unsigned int i = 0; i < ITERATIONS; i++)
00281A59 mov dword ptr [ebp-4],0
00281A60 jmp test_body<536870912,2>+1Bh (0281A6Bh)
00281A62 mov eax,dword ptr [ebp-4]
00281A65 add eax,1
00281A68 mov dword ptr [ebp-4],eax
00281A6B cmp dword ptr [ebp-4],2
00281A6F jae test_body<536870912,2>+53h (0281AA3h)
{
for (unsigned int x = 0; x < SIZE; x++)
00281A71 mov dword ptr [ebp-8],0
00281A78 jmp test_body<536870912,2>+33h (0281A83h)
00281A7A mov eax,dword ptr [ebp-8]
{
for (unsigned int x = 0; x < SIZE; x++)
00281A7D add eax,1
00281A80 mov dword ptr [ebp-8],eax
00281A83 cmp dword ptr [ebp-8],20000000h
00281A8A jae test_body<536870912,2>+51h (0281AA1h)
{
array[x]++;
00281A8C mov eax,dword ptr [array]
00281A8F add eax,dword ptr [ebp-8]
00281A92 mov cl,byte ptr [eax]
00281A94 add cl,1
00281A97 mov edx,dword ptr [array]
00281A9A add edx,dword ptr [ebp-8]
00281A9D mov byte ptr [edx],cl
}
00281A9F jmp test_body<536870912,2>+2Ah (0281A7Ah)
}
00281AA1 jmp test_body<536870912,2>+12h (0281A62h)
答案 0 :(得分:6)
TL; DR:您的测试不正确测试表示缓存延迟或速度。相反,它测量了通过OoO CPU管道斩波复杂代码的一些问题。
使用正确的测试来测量缓存和内存延迟:lat_mem_rd from lmbench;和速度(带宽)测量的正确测试:STREAM benchmark用于记忆速度; tests from memtest86表示缓存速度为rep movsl
main operation)
此外,在现代(2010年及更新版本)桌面/服务器CPU中,在L1和L2缓存附近内置了硬件预取逻辑,它将检测线性访问模式并将数据从外部缓存预加载到内部您将要求提供此数据:Intel Optimization Manual - 7.2 Hardware prefetching of data,第365页; intel.com blog, 2009。很难禁用所有硬件预取(SO Q/A 1,SO Q/A 2)
长篇故事:
我将尝试使用Linux中的 perf
性能监控工具(又名perf_events
)对类似测试进行多次测量。代码基于来自Joky的程序(32位整数数组,而不是字符数组),并被分成几个二进制文件:a5
的大小为2 ^ 5 = 32; a10
=&gt; 2 ^ 10 = 1024(4 KB); a15
=&gt; 2 ^ 15 = 32768,a20
(100万个整数= 4 MB)和a25
(32百万个整数= 128 MB)。 cpu是i7-2600四核Sandy Bridge 3.4 GHz。
让我们从基本perf stat
开始,设置默认事件(跳过某些行)。我选择了2 ^ 10(4 KB)和2 ^ 20(4 MB)
$ perf stat ./a10
Size=1024 ITERATIONS=1048576, TIME=2372.09 ms
Performance counter stats for './a10':
276 page-faults # 0,000 M/sec
8 238 473 169 cycles # 3,499 GHz
4 936 244 310 stalled-cycles-frontend # 59,92% frontend cycles idle
415 849 629 stalled-cycles-backend # 5,05% backend cycles idle
11 832 421 238 instructions # 1,44 insns per cycle
# 0,42 stalled cycles per insn
1 078 974 782 branches # 458,274 M/sec
1 080 091 branch-misses # 0,10% of all branches
$ perf stat ./a20
Size=1048576 ITERATIONS=1024, TIME=2432.4 ms
Performance counter stats for './a20':
2 321 page-faults # 0,001 M/sec
8 487 656 735 cycles # 3,499 GHz
5 184 295 720 stalled-cycles-frontend # 61,08% frontend cycles idle
663 245 253 stalled-cycles-backend # 7,81% backend cycles idle
11 836 712 988 instructions # 1,39 insns per cycle
# 0,44 stalled cycles per insn
1 077 257 745 branches # 444,104 M/sec
30 601 branch-misses # 0,00% of all branches
我们在这里可以看到什么?指令计数非常接近(因为大小*迭代是常数),循环计数和时间也很接近。两个例子都有10亿个分支,99%的预测良好。但是前端有60%的失速计数,后端有5-8%的失速计数。前端档位是指令提取和解码中的停顿,很难说明原因,但是对于你的代码前端不能解码每个tick的4条指令( Intel optimisation manual的第B-41页,B.3节 - &#34;性能调整技术... Sandy Bridge&#34;,B.3.2分层自上而下性能表征......)
$ perf record -e stalled-cycles-frontend ./a20
Size=1048576 ITERATIONS=1024, TIME=2477.65 ms
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.097 MB perf.data (~4245 samples) ]
$ perf annotate -d a20|cat
Percent | Source code & Disassembly of a20
------------------------------------------------
: 08048e6f <void test_body<1048576u, 1024u>(int volatile*)>:
10.43 : 8048e87: mov -0x8(%ebp),%eax
1.10 : 8048e8a: lea 0x0(,%eax,4),%edx
0.16 : 8048e91: mov 0x8(%ebp),%eax
0.78 : 8048e94: add %edx,%eax
6.87 : 8048e96: mov (%eax),%edx
52.53 : 8048e98: add $0x1,%edx
9.89 : 8048e9b: mov %edx,(%eax)
14.15 : 8048e9d: addl $0x1,-0x8(%ebp)
2.66 : 8048ea1: mov -0x8(%ebp),%eax
1.39 : 8048ea4: cmp $0xfffff,%eax
或者在这里使用原始操作码(objdump -d
),有些具有相当复杂的索引,因此有可能它们不能被3个简单的解码器处理并等待唯一复杂的解码器(图像在那里:{{ 3}})
8048e87: 8b 45 f8 mov -0x8(%ebp),%eax
8048e8a: 8d 14 85 00 00 00 00 lea 0x0(,%eax,4),%edx
8048e91: 8b 45 08 mov 0x8(%ebp),%eax
8048e94: 01 d0 add %edx,%eax
8048e96: 8b 10 mov (%eax),%edx
8048e98: 83 c2 01 add $0x1,%edx
8048e9b: 89 10 mov %edx,(%eax)
8048e9d: 83 45 f8 01 addl $0x1,-0x8(%ebp)
8048ea1: 8b 45 f8 mov -0x8(%ebp),%eax
8048ea4: 3d ff ff 0f 00 cmp $0xfffff,%eax
后端停顿是通过等待内存或缓存(在测量缓存时感兴趣的东西)和内部执行核心停顿来创建的停顿:
$ perf record -e stalled-cycles-backend ./a20
Size=1048576 ITERATIONS=1024, TIME=2480.09 ms
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.095 MB perf.data (~4149 samples) ]
$ perf annotate -d a20|cat
4.25 : 8048e96: mov (%eax),%edx
58.68 : 8048e98: add $0x1,%edx
8.86 : 8048e9b: mov %edx,(%eax)
3.94 : 8048e9d: addl $0x1,-0x8(%ebp)
7.66 : 8048ea1: mov -0x8(%ebp),%eax
7.40 : 8048ea4: cmp $0xfffff,%eax
大多数后端停顿报告为add 0x1,%edx
,因为它是数据的使用者,在上一个命令中从数组加载。对于存储到数组,它们占后端档位的70%,或者如果我们在程序中的总后端档位部分(7%)乘以所有档位的5%。或者换句话说,http://www.realworldtech.com/sandy-bridge/4/比你的程序。现在我们可以回答您的第一个问题:
为什么不管数组的大小如何,所用的时间都不会增加?
您的测试非常糟糕(未优化),您正在尝试测量缓存,但它们的总运行时间仅减慢了5%。你的测试是如此不稳定(嘈杂),你不会看到这5%的效果。
使用自定义perf stat
运行,我们还可以测量缓存请求丢失率。对于4 KB程序,L1数据缓存服务于99,99%的所有负载和99,999%的所有存储。我们可以注意到,您的不正确测试产生的缓存请求数量要多于在数组上行走和增加每个元素(10亿次加载+ 10亿次存储)所需的数量。其他访问用于处理x
等局部变量,它们始终由缓存提供服务,因为它们的地址是常量)
$ perf stat -e 'L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses' ./a10
Size=1024 ITERATIONS=1048576, TIME=2412.25 ms
Performance counter stats for './a10':
5 375 195 765 L1-dcache-loads
364 140 L1-dcache-load-misses # 0,01% of all L1-dcache hits
2 151 408 053 L1-dcache-stores
13 350 L1-dcache-store-misses
对于4 MB程序,命中率要差很多倍。失误多100倍!现在,1.2%的内存请求不是由L1提供,而是由L2提供。
$ perf stat -e 'L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses' ./a20
Size=1048576 ITERATIONS=1024, TIME=2443.92 ms
Performance counter stats for './a20':
5 378 035 007 L1-dcache-loads
67 725 008 L1-dcache-load-misses # 1,26% of all L1-dcache hits
2 152 183 588 L1-dcache-stores
67 266 426 L1-dcache-store-misses
当我们想要注意缓存延迟如何cache is faster(3倍长),以及此更改仅影响1.2%的缓存请求,以及我们的程序只有7时,是不是这种情况对缓存延迟敏感的%slowdown ???
如果我们使用更大的数据集怎么办?好的,这是a25(2 ^ 25的4字节整数= 128 MB,是缓存大小的几倍):
$ perf stat ./a25
Size=134217728 ITERATIONS=8, TIME=2437.25 ms
Performance counter stats for './a25':
262 417 page-faults # 0,090 M/sec
10 214 588 827 cycles # 3,499 GHz
6 272 114 853 stalled-cycles-frontend # 61,40% frontend cycles idle
1 098 632 880 stalled-cycles-backend # 10,76% backend cycles idle
13 683 671 982 instructions # 1,34 insns per cycle
# 0,46 stalled cycles per insn
1 274 410 549 branches # 436,519 M/sec
315 656 branch-misses # 0,02% of all branches
$ perf stat -e 'L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses' ./a25
Size=134217728 ITERATIONS=8, TIME=2444.13 ms
Performance counter stats for './a25':
6 138 410 226 L1-dcache-loads
77 025 747 L1-dcache-load-misses # 1,25% of all L1-dcache hits
2 515 141 824 L1-dcache-stores
76 320 695 L1-dcache-store-misses
几乎相同的L1未命中率,以及更多的后端档位。我能够获得关于&#34;缓存引用,缓存未命中&#34;的统计信息。事件ans我建议他们关于L3缓存(对L2的请求有几倍):
$ perf stat -e 'cache-references,cache-misses' ./a25
Size=134217728 ITERATIONS=8, TIME=2440.71 ms
Performance counter stats for './a25':
17 053 482 cache-references
11 829 118 cache-misses # 69,365 % of all cache refs
因此,未命中率很高,但测试会产生10亿(有用)负载,而其中只有0.08亿缺少L1。内存提供了10亿个请求。内存延迟大约为from 4 cpu ticks up to 12,而不是4个时钟L1延迟。测试能看到这个吗?可能是,如果噪音很低。
答案 1 :(得分:1)
一些结果(OSX,Sandy Bridge):
Size=1 ITERATIONS=1073741824, TIME=2416.06 ms
Size=2 ITERATIONS=536870912, TIME=1885.46 ms
Size=4 ITERATIONS=268435456, TIME=1782.92 ms
Size=16 ITERATIONS=67108864, TIME=2023.71 ms
Size=32 ITERATIONS=33554432, TIME=2184.99 ms
Size=64 ITERATIONS=16777216, TIME=2464.09 ms
Size=256 ITERATIONS=4194304, TIME=2358.31 ms
Size=1024 ITERATIONS=1048576, TIME=2333.77 ms
Size=2048 ITERATIONS=524288, TIME=2340.16 ms
Size=4096 ITERATIONS=262144, TIME=2349.97 ms
Size=8192 ITERATIONS=131072, TIME=2346.96 ms
Size=16384 ITERATIONS=65536, TIME=2350.3 ms
Size=32768 ITERATIONS=32768, TIME=2348.71 ms
Size=65536 ITERATIONS=16384, TIME=2355.28 ms
Size=262144 ITERATIONS=4096, TIME=2358.97 ms
Size=524288 ITERATIONS=2048, TIME=2476.46 ms
Size=1048576 ITERATIONS=1024, TIME=2429.07 ms
Size=2097152 ITERATIONS=512, TIME=2427.09 ms
Size=4194304 ITERATIONS=256, TIME=2443.42 ms
Size=8388608 ITERATIONS=128, TIME=2435.54 ms
Size=33554432 ITERATIONS=32, TIME=2389.08 ms
Size=134217728 ITERATIONS=8, TIME=2444.43 ms
Size=536870912 ITERATIONS=2, TIME=2600.91 ms
Size=1 ITERATIONS=1073741824, TIME=2197.12 ms
Size=2 ITERATIONS=536870912, TIME=996.409 ms
Size=4 ITERATIONS=268435456, TIME=606.252 ms
Size=16 ITERATIONS=67108864, TIME=306.904 ms
Size=32 ITERATIONS=33554432, TIME=897.692 ms
Size=64 ITERATIONS=16777216, TIME=847.794 ms
Size=256 ITERATIONS=4194304, TIME=802.136 ms
Size=1024 ITERATIONS=1048576, TIME=761.971 ms
Size=2048 ITERATIONS=524288, TIME=760.136 ms
Size=4096 ITERATIONS=262144, TIME=759.149 ms
Size=8192 ITERATIONS=131072, TIME=749.881 ms
Size=16384 ITERATIONS=65536, TIME=756.672 ms
Size=32768 ITERATIONS=32768, TIME=759.565 ms
Size=65536 ITERATIONS=16384, TIME=754.81 ms
Size=262144 ITERATIONS=4096, TIME=745.899 ms
Size=524288 ITERATIONS=2048, TIME=749.527 ms
Size=1048576 ITERATIONS=1024, TIME=758.009 ms
Size=2097152 ITERATIONS=512, TIME=776.671 ms
Size=4194304 ITERATIONS=256, TIME=778.963 ms
Size=8388608 ITERATIONS=128, TIME=783.191 ms
Size=33554432 ITERATIONS=32, TIME=770.603 ms
Size=134217728 ITERATIONS=8, TIME=785.703 ms
Size=536870912 ITERATIONS=2, TIME=911.875 ms
(请注意第一个是如何真的慢,我觉得在加载 - 存储转发的某处可能存在误推...)
有趣的是,启用优化并删除volatile会显示出更好的曲线:
Size=1 ITERATIONS=1073741824, TIME=0 ms
Size=2 ITERATIONS=536870912, TIME=0 ms
Size=4 ITERATIONS=268435456, TIME=0 ms
Size=16 ITERATIONS=67108864, TIME=0.001 ms
Size=32 ITERATIONS=33554432, TIME=125.581 ms
Size=64 ITERATIONS=16777216, TIME=140.654 ms
Size=256 ITERATIONS=4194304, TIME=217.559 ms
Size=1024 ITERATIONS=1048576, TIME=168.155 ms
Size=2048 ITERATIONS=524288, TIME=159.031 ms
Size=4096 ITERATIONS=262144, TIME=154.373 ms
Size=8192 ITERATIONS=131072, TIME=153.858 ms
Size=16384 ITERATIONS=65536, TIME=156.819 ms
Size=32768 ITERATIONS=32768, TIME=156.505 ms
Size=65536 ITERATIONS=16384, TIME=156.921 ms
Size=262144 ITERATIONS=4096, TIME=215.911 ms
Size=524288 ITERATIONS=2048, TIME=220.298 ms
Size=1048576 ITERATIONS=1024, TIME=235.648 ms
Size=2097152 ITERATIONS=512, TIME=320.284 ms
Size=4194304 ITERATIONS=256, TIME=409.433 ms
Size=8388608 ITERATIONS=128, TIME=431.743 ms
Size=33554432 ITERATIONS=32, TIME=429.436 ms
Size=134217728 ITERATIONS=8, TIME=430.052 ms
Size=536870912 ITERATIONS=2, TIME=535.773 ms
为了帮助任何人重现“问题”,这里有一些标准的(我希望)C ++代码:
#include <string>
#include <iostream>
#include <chrono>
#include <cstdlib>
#include <memory>
template <unsigned int SIZE, unsigned int ITERATIONS>
void test_body(volatile int *array) {
for (int i = 0; i < ITERATIONS; i++)
{
for (int x = 0; x < SIZE; x++)
{
array[x]++;
}
}
}
template <unsigned int SIZE, unsigned int ITERATIONS>
static void test_function()
{
static_assert(SIZE*ITERATIONS == 1024*1024*1024, "SIZE MISMATCH");
std::unique_ptr<volatile int[]> array { new int[SIZE] };
// Warmup
test_body<SIZE, 1>(array.get());
auto start = std::chrono::steady_clock::now();
test_body<SIZE, ITERATIONS>(array.get());
auto end = std::chrono::steady_clock::now();
auto diff = end - start;
std::cout << "Size=" << SIZE << " ITERATIONS=" << ITERATIONS << ", TIME=" << std::chrono::duration <double, std::milli> (diff).count() << " ms" << std::endl;
}
int main()
{
enum { eIterations= 1024*1024*1024};
test_function<1, eIterations>();
test_function<2, eIterations/2>();
test_function<4, eIterations/4>();
test_function<16, eIterations/16>();
test_function<32, eIterations/ 32>();
test_function<64, eIterations/ 64>();
test_function<256, eIterations/ 256>();
test_function<1024, eIterations/ 1024>();
test_function<2048, eIterations/ 2048>();
test_function<4096, eIterations/ 4096>();
test_function<8192, eIterations/ 8192>();
test_function<16384, eIterations/ 16384>();
test_function<32768, eIterations/ 32768>();
test_function<65536, eIterations/ 65536>();
test_function<262144, eIterations/ 262144>();
test_function<524288, eIterations/ 524288>();
test_function<1048576, eIterations/ 1048576>();
test_function<2097152, eIterations/ 2097152>();
test_function<4194304, eIterations/ 4194304>();
test_function<8388608, eIterations/ 8388608>();
test_function<33554432, eIterations/ 33554432>();
test_function<134217728, eIterations/ 134217728>();
test_function<536870912, eIterations/ 536870912>();
}
答案 2 :(得分:1)
似乎很清楚,恒定时间意味着恒定的指令执行率。要测量缓存/ RAM速度,数据传输指令应占主导地位,结果需要进一步说明,而不是运行时间,如MB /秒和每秒指令。你需要像我的BusSpeed基准测试(谷歌的Roy BusSpeed基准测试或BusSpd2k的源代码和结果与Windows,Linux和Android的版本)。原始使用的汇编代码包含如下指令:
"add edx,ecx" \
"mov ebx,[edi]" \
"mov ecx,ebx" \
"lp: and ebx,[edx]" \
"and ecx,[edx+4]" \
"and ebx,[edx+8]" \
"and ecx,[edx+12]" \
"and ebx,[edx+16]" \
"and ecx,[edx+20]" \
"and ebx,[edx+24]" \
"and ecx,[edx+28]" \
"and ebx,[edx+32]" \
"and ecx,[edx+36]" \
"and ebx,[edx+40]" \
To
"and ecx,[edx+236]" \
"and ebx,[edx+240]" \
"and ecx,[edx+244]" \
"and ebx,[edx+248]" \
"and ecx,[edx+252]" \
"add edx,256" \
"dec eax" \
"jnz lp" \
"and ebx,ecx" \
"mov [edi],ebx" \
以后的版本使用C如下
void inc1word()
{
int i, j;
for(j=0; j<passes1; j++)
{
for (i=0; i<wordsToTest; i=i+64)
{
andsum1 = andsum1 & array[i ] & array[i+1 ] & array[i+2 ] & array[i+3 ]
& array[i+4 ] & array[i+5 ] & array[i+6 ] & array[i+7 ]
& array[i+8 ] & array[i+9 ] & array[i+10] & array[i+11]
& array[i+12] & array[i+13] & array[i+14] & array[i+15]
& array[i+16] & array[i+17] & array[i+18] & array[i+19]
& array[i+20] & array[i+21] & array[i+22] & array[i+23]
& array[i+24] & array[i+25] & array[i+26] & array[i+27]
& array[i+28] & array[i+29] & array[i+30] & array[i+31]
& array[i+32] & array[i+33] & array[i+34] & array[i+35]
& array[i+36] & array[i+37] & array[i+38] & array[i+39]
& array[i+40] & array[i+41] & array[i+42] & array[i+43]
& array[i+44] & array[i+45] & array[i+46] & array[i+47]
& array[i+48] & array[i+49] & array[i+50] & array[i+51]
& array[i+52] & array[i+53] & array[i+54] & array[i+55]
& array[i+56] & array[i+57] & array[i+58] & array[i+59]
& array[i+60] & array[i+61] & array[i+62] & array[i+63];
}
}
}
该基准测试测量缓存和RAM的MB /秒,包括跳过顺序寻址以查看突发中读取数据的位置。示例结果如下。注意突发读取效果和读取到两个不同的寄存器(Reg2,来自汇编代码版本)可以快于1.然后,在这种情况下,将每个字加载到1个寄存器(AndI,Reg1,Inc4字节)产生几乎恒定的速度(大约1400 MIPS)。因此,即使很长的指令序列也可能不适合特定的流水线)。找出答案的方法是运行更广泛的测试。
############################################### ########################## 英特尔(R)Core(TM)i7 CPU 930 @ 2.80GHz测量2807 MHz
Windows Bus Speed Test Version 2.2 by Roy Longbottom
Minimum 0.100 seconds per test, Start Fri Jul 30 16:43:56 2010
MovI MovI MovI MovI MovI MovI AndI AndI MovM MovM
Memory Reg2 Reg2 Reg2 Reg2 Reg1 Reg2 Reg1 Reg2 Reg1 Reg8
KBytes Inc64 Inc32 Inc16 Inc8 Inc4 Inc4 Inc4 Inc4 Inc8 Inc8
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
4 10025 10800 11262 11498 11612 11634 5850 11635 23093 23090
8 10807 11267 11505 11627 11694 11694 5871 11694 23299 23297
16 11251 11488 11620 11614 11712 11719 5873 11718 23391 23398
32 9893 9853 10890 11170 11558 11492 5872 11466 21032 21025
64 3219 4620 7289 9479 10805 10805 5875 10797 14426 14426
128 3213 4805 7305 9467 10811 10810 5875 10805 14442 14408
256 3144 4592 7231 9445 10759 10733 5870 10743 14336 14337
512 2005 3497 5980 9056 10466 10467 5871 10441 13906 13905
1024 2003 3482 5974 9017 10468 10466 5874 10467 13896 13818
2048 2004 3497 5958 9088 10447 10448 5870 10447 13857 13857
4096 1963 3398 5778 8870 10328 10328 5851 10328 13591 13630
8192 1729 3045 5322 8270 9977 9963 5728 9965 12923 12892
16384 692 1402 2495 4593 7811 7782 5406 7848 8335 8337
32768 695 1406 2492 4584 7820 7826 5401 7792 8317 8322
65536 695 1414 2488 4584 7823 7826 5403 7800 8321 8321
131072 696 1402 2491 4575 7827 7824 5411 7846 8322 8323
262144 696 1413 2498 4594 7791 7826 5409 7829 8333 8334
524288 693 1416 2498 4595 7841 7842 5411 7847 8319 8285
1048576 704 1415 2478 4591 7845 7840 5410 7853 8290 8283
End of test Fri Jul 30 16:44:29 2010
MM使用1和8 MMX寄存器,后续版本使用SSE
任何人都可以免费使用源代码和执行文件。文件位于以下数组声明的位置:
视窗 http://www.roylongbottom.org.uk/busspd2k.zip
xx = (int *)VirtualAlloc(NULL, useMemK*1024+256, MEM_COMMIT, PAGE_READWRITE);
Linux的 http://www.roylongbottom.org.uk/memory_benchmarks.tar.gz
#ifdef Bits64
array = (long long *)_mm_malloc(memoryKBytes[ipass-1]*1024, 16);
#else
array = (int *)_mm_malloc(memoryKBytes[ipass-1]*1024, 16);
结果和其他链接(MP版,Android)位于:
答案 3 :(得分:0)
我没有得到恒定的时间。我稍微修改了你的代码以使其更简单。我的时间比你的低很多。我不知道为什么。一开始的大时间是有意义的,因为只有少数值要写,所以它是一个依赖链。 L2缓存以256k / 4 = 64k结束。注意值如何在size = 32768和65536之间开始上升。
//GCC -O3 Intel(R) Xeon(R) CPU E5-1620 0 @ 3.60GHz
Size=1, Iterations=1073741824, Time=187.18 ms
Size=2, Iterations=536870912, Time=113.47 ms
Size=4, Iterations=268435456, Time=50.53 ms
Size=8, Iterations=134217728, Time=25.02 ms
Size=16, Iterations=67108864, Time=25.61 ms
Size=32, Iterations=33554432, Time=24.08 ms
Size=64, Iterations=16777216, Time=22.69 ms
Size=128, Iterations=8388608, Time=22.03 ms
Size=256, Iterations=4194304, Time=19.98 ms
Size=512, Iterations=2097152, Time=17.09 ms
Size=1024, Iterations=1048576, Time=15.66 ms
Size=2048, Iterations=524288, Time=14.94 ms
Size=4096, Iterations=262144, Time=14.58 ms
Size=8192, Iterations=131072, Time=14.40 ms
Size=16384, Iterations=65536, Time=14.63 ms
Size=32768, Iterations=32768, Time=14.75 ms
Size=65536, Iterations=16384, Time=18.58 ms
Size=131072, Iterations=8192, Time=20.51 ms
Size=262144, Iterations=4096, Time=21.18 ms
Size=524288, Iterations=2048, Time=21.26 ms
Size=1048576, Iterations=1024, Time=21.22 ms
Size=2097152, Iterations=512, Time=22.17 ms
Size=4194304, Iterations=256, Time=38.01 ms
Size=8388608, Iterations=128, Time=38.63 ms
Size=16777216, Iterations=64, Time=38.09 ms
Size=33554432, Iterations=32, Time=38.54 ms
Size=67108864, Iterations=16, Time=39.11 ms
Size=134217728, Iterations=8, Time=39.96 ms
Size=268435456, Iterations=4, Time=42.15 ms
Size=536870912, Iterations=2, Time=46.39 ms
代码:
#include <stdio.h>
#include <omp.h>
static void test_function(int n, int iterations)
{
int *array = new int[n];
for (int i = 0; i < iterations; i++)
for (int x = 0; x < n; x++)
array[x]++;
delete[] array;
}
int main() {
for(int i=0, n=1, iterations=1073741824; i<30; i++, n*=2, iterations/=2) {
double dtime;
dtime = omp_get_wtime();
test_function(n, iterations);
dtime = omp_get_wtime() - dtime;
printf("Size=%d, Iterations=%d, Time=%.3f\n", n, iterations, dtime);
}
}