我的CPU的规格说它应该为内存带来5.336GB / s的带宽。为了测试这个,我写了一个简单的程序,在一个大数组上运行memset(或memcpy)并报告时间。我在memset上显示3.8GB / s,在memcpy上显示1.9GB / s。 http://en.wikipedia.org/wiki/Intel_Core_(microarchitecture)说我的Q9400应该达到5.336MB / s。怎么了?
我尝试用赋值循环替换memset或memcpy。我已经google了一下,试图了解内存对齐情况。我尝试过不同的编译器标志。我花了很多时间在这上面尴尬。感谢您提供的任何帮助!
我正在使用Ubuntu 12.04和libc-dev版本2.15-0ubuntu10.5和内核3.8.0-37-generic
代码:
#include <stdio.h>
#include <time.h>
#include <string.h>
#include <stdlib.h>
#define numBytes ((long)(1024*1024*1024))
#define numTransfers ((long)(8))
int main(int argc,char**argv){
if(argc!=3){
printf("Usage: %s BLOCK_SIZE_IN_BYTES NUMBER_OF_BLOCKS_TO_TRANSFER\n",argv[0]);
return -1;
}
char*__restrict__ source=(char*)malloc(numBytes);
char*__restrict__ dest=(char*)malloc(numBytes);
struct timespec start,end;
long totalTimeMs;
int i;
clock_gettime(CLOCK_MONOTONIC_RAW,&start);
for(i=0;i<numTransfers;++i)
memset(source,0,numBytes);
clock_gettime(CLOCK_MONOTONIC_RAW,&end);
totalTimeMs=(end.tv_nsec-start.tv_nsec)*.000001+1000*(end.tv_sec-start.tv_sec);
printf("memset %ld bytes %ld times (%.2fGB total) in %ldms (%.3fGB/s). ",numBytes,numTransfers,numBytes/1024.0/1024/1024*numTransfers,totalTimeMs,numBytes/1024.0/1024/1024*1000*numTransfers/totalTimeMs);
clock_gettime(CLOCK_MONOTONIC_RAW,&start);
for(i=0;i<numTransfers;++i)
memcpy( dest, source, numBytes);
clock_gettime(CLOCK_MONOTONIC_RAW,&end);
totalTimeMs=(end.tv_nsec-start.tv_nsec)*.000001+1000*(end.tv_sec-start.tv_sec);
printf("memcpy %ld bytes %ld times (%.2fGB total) in %ldms (%.3fGB/s).\n",numBytes,numTransfers,numBytes/1024.0/1024/1024*numTransfers,totalTimeMs,numBytes/1024.0/1024/1024*1000*numTransfers/totalTimeMs);
free(source);
free(dest);
return EXIT_SUCCESS;
}
编译命令:
gcc -O3 -DNDEBUG -o memcpyStackOverflowNoParameters.c.o -c memcpyStackOverflowNoParameters.c
gcc -O3 -DNDEBUG memcpyStackOverflowNoParameters.c.o -o memcpy -rdynamic -lrt
示例输出:
memset 1073741824 bytes 8 times (8.00GB total) in 2214ms (3.880GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4466ms (1.923GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2218ms (3.873GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4557ms (1.885GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2222ms (3.866GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4433ms (1.938GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2216ms (3.876GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4521ms (1.900GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2217ms (3.875GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4520ms (1.900GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2218ms (3.873GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4430ms (1.939GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2226ms (3.859GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4444ms (1.933GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2225ms (3.861GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4485ms (1.915GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2620ms (3.279GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4855ms (1.769GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2535ms (3.389GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4870ms (1.764GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2423ms (3.545GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4905ms (1.751GB/s).
我的硬件根据lshw:
product: OptiPlex 960 ()
vendor: Winbond Electronics
width: 64 bits
*-core
description: Motherboard
product: 0Y958C
vendor: Winbond Electronics
*-firmware
description: BIOS
capabilities: pci pnp apm upgrade shadowing escd cdboot bootselect edd int13floppytoshiba int13floppy720 int5printscreen int9keyboard int14serial int17printer acpi usb biosbootspecification netboot
*-cpu
product: Intel(R) Core(TM)2 Quad CPU Q9400 @ 2.66GHz
physical id: 400
size: 2666MHz
width: 64 bits
clock: 1333MHz
capabilities: x86-64 fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm dtherm tpr_shadow vnmi flexpriority
configuration: cores=4 enabledcores=4 threads=4
*-cache:0
description: L1 cache
physical id: 700
size: 256KiB
capacity: 256KiB
capabilities: internal write-back unified
*-cache:1
description: L2 cache
physical id: 701
size: 6MiB
capacity: 6MiB
capabilities: internal varies unified
*-memory
description: System Memory
physical id: 1000
slot: System board or motherboard
size: 4GiB
*-bank:0
description: DIMM DDR2 Synchronous 667 MHz (1.5 ns)
product: CT51264AA667.M16FC
vendor: 7F7F7F7F7F9B0000
slot: DIMM_1
size: 4GiB
width: 64 bits
clock: 667MHz (1.5ns)
*-bank:1
description: DIMM DDR2 Synchronous 667 MHz (1.5 ns) [empty]
*-bank:2
description: DIMM DDR2 Synchronous 667 MHz (1.5 ns) [empty]
*-bank:3
description: DIMM DDR2 Synchronous 667 MHz (1.5 ns) [empty]
答案 0 :(得分:5)
内存地址被“虚拟化”,程序使用的地址被转换为真实地址。这种转换使得有可能将您的程序看作连续内存的内容从当时的任何内容中分配出来。每个通用CPU都这样做。转换需要表查找,这需要内存访问。 CPU有缓存,但长时间的虚拟地址很容易打破它的缓存,即“TLB”(“转换后备缓冲区”)。因此,每隔4KB(Linux系统上的2MB就能找到你正在做的事情),CPU会停止寻找真正发送内存流量的地方。这些摊位可能需要相当多的时间。您可以尝试运行两个基准测试副本,TLB未命中的情况似乎是合理的,并且您将获得更接近您的额定容量的总带宽。
(编辑:嗯,你可能想用
替换你的#define
size_t numBytes=atoi(argv[1]);
size_t numTransfers=atoi(argv[2]);
在主体......)
编辑:顺便说一句:我在我的盒子上看到的测试带宽(并在评论中报告)远低于我的cpu的额定容量,它让我调查了我自己的系统。我的盒子构建器在这些插槽中放置了真正的废话。我早已用一个知名品牌取代它们,报告的吞吐量增加了一倍以上,并且非常明显地提高了我的机器的性能。
答案 1 :(得分:3)
最后我检查了memcpy和memset没有在GCC中进行优化。这仍然是真的in 2012。参见Agner Fog的Optimizing software in C++部分2.6 2.6“功能库的选择”和表2.1。他比较了几种不同的编译器和操作系统。
GCC内置了用于执行memcpy的功能。显然,它们比Glib中的memcpy更糟糕。据我所知,GCC开发人员和Glib开发人员独立工作。要从Glib获取库,您需要使用-fno-builtin
。然而,虽然Glib(或者至少是)更好,但它仍然不是最佳的。为了获得最佳效果,请使用Agner Fog的asmlib。他优化了memcpy和memset以及汇编中的许多其他常用功能,以便在其他优化中利用SSE和AVX。