任何人都可以给我大概的时间(以纳秒为单位)来访问L1,L2和L3缓存,以及Intel i7处理器上的主内存吗?
虽然这不是一个特别的编程问题,但是对于某些低延迟编程挑战而言,了解这些速度细节是必要的。
答案 0 :(得分:173)
每个人都应该知道的数字
0.5 ns - CPU L1 dCACHE reference
1 ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance
5 ns - CPU L1 iCACHE Branch mispredict
7 ns - CPU L2 CACHE reference
71 ns - CPU cross-QPI/NUMA best case on XEON E5-46*
100 ns - MUTEX lock/unlock
100 ns - own DDR MEMORY reference
135 ns - CPU cross-QPI/NUMA best case on XEON E7-*
202 ns - CPU cross-QPI/NUMA worst case on XEON E7-*
325 ns - CPU cross-QPI/NUMA worst case on XEON E5-46*
10,000 ns - Compress 1K bytes with Zippy PROCESS
20,000 ns - Send 2K bytes over 1 Gbps NETWORK
250,000 ns - Read 1 MB sequentially from MEMORY
500,000 ns - Round trip within a same DataCenter
10,000,000 ns - DISK seek
10,000,000 ns - Read 1 MB sequentially from NETWORK
30,000,000 ns - Read 1 MB sequentially from DISK
150,000,000 ns - Send a NETWORK packet CA -> Netherlands
| | | |
| | | ns|
| | us|
| ms|
自:
最初由Peter Norvig撰写:
- http://norvig.com/21-days.html#answers
- http://surana.wordpress.com/2009/01/01/numbers-everyone-should-know/,
- http://sites.google.com/site/io/building-scalable-web-applications-with-google-app-engine
答案 1 :(得分:57)
Here is a Performance Analysis Guide用于i7和Xeon系列处理器。我应该强调,这有你需要的东西和更多(例如,检查第22页的某些时间和周期)。
此外,this page有关于时钟周期等的一些细节。第二个链接提供以下数字:
Core i7 Xeon 5500 Series Data Source Latency (approximate) [Pg. 22]
local L1 CACHE hit, ~4 cycles ( 2.1 - 1.2 ns )
local L2 CACHE hit, ~10 cycles ( 5.3 - 3.0 ns )
local L3 CACHE hit, line unshared ~40 cycles ( 21.4 - 12.0 ns )
local L3 CACHE hit, shared line in another core ~65 cycles ( 34.8 - 19.5 ns )
local L3 CACHE hit, modified in another core ~75 cycles ( 40.2 - 22.5 ns )
remote L3 CACHE (Ref: Fig.1 [Pg. 5]) ~100-300 cycles ( 160.7 - 30.0 ns )
local DRAM ~60 ns
remote DRAM ~100 ns
<强> EDIT2
强>:
最重要的是引用表下的通知,说:
“注意:这些值是粗略的近似。他们依赖 核心和UNCORE频率,内存速度,BIOS设置, 数量的DIMM ,ETC,ETC .. 您的里程可能会有所不同。“
编辑:我应该强调一下,除了时序/周期信息外,上述英特尔文档还解决了i7和Xeon系列处理器的更多(非常)有用的细节(从性能的角度来看)。
答案 2 :(得分:32)
自2005年以来,价值已下降但趋于稳定
1 ns L1 cache
3 ns Branch mispredict
4 ns L2 cache
17 ns Mutex lock/unlock
100 ns Main memory (RAM)
2 000 ns (2µs) 1KB Zippy-compress
还有一些改进,预测2020年
16 000 ns (16µs) SSD random read (olibre's note: should be less)
500 000 ns (½ms) Round trip in datacenter
2 000 000 ns (2ms) HDD random read (seek)
为了进一步了解,我建议presentation of modern cache architectures,Gerhard Wellein,Hannes Hofmann和Dietmar Fey University Erlangen-Nürnberg的优秀SpaceFox(2014年6月)。
讲法语的人可能会感谢a processor with a developer的一篇文章,比较{{3}}等待继续工作所需的信息。
答案 3 :(得分:20)
-Dcom.ibm.websphere.persistence.DisableJpaFormatUrlProtocol=true
比较即使是最简单的CPU /缓存/ DRAM阵容(即使在统一的内存访问模型中)也不是一项容易的任务,其中DRAM速度是确定延迟和加载延迟(饱和系统)的一个因素,后者是后者规则和企业应用程序将体验的不仅仅是空闲的完全卸载系统。
Still some improvements, prediction for 2020 (Ref. olibre's answer below)
-------------------------------------------------------------------------
16 000 ns ( 16 µs) SSD random read (olibre's note: should be less)
500 000 ns ( ½ ms) Round trip in datacenter
2 000 000 ns ( 2 ms) HDD random read (seek)
In 2015 there are currently available:
========================================================================
820 ns ( 0.8µs) random read from a SSD-DataPlane
1 200 ns ( 1.2µs) Round trip in datacenter
1 200 ns ( 1.2µs) random read from a HDD-DataPlane
GPU引擎已经获得了大量的技术营销,而深层内部依赖关系是理解这些架构在实践中体验的真正优势和真正弱点的关键(通常与积极营销吹口哨的期望大不相同)。
+----------------------------------- 5,6,7,8,9,..12,15,16
| +--- 1066,1333,..2800..3300
v v
First word = ( ( CAS latency * 2 ) + ( 1 - 1 ) ) / Data Rate
Fourth word = ( ( CAS latency * 2 ) + ( 4 - 1 ) ) / Data Rate
Eighth word = ( ( CAS latency * 2 ) + ( 8 - 1 ) ) / Data Rate
^----------------------- 7x .. difference
********************************
So:
===
resulting DDR3-side latencies are between _____________
3.03 ns ^
|
36.58 ns ___v_ based on DDR3 HW facts
因此,了解内部性比其他领域更重要,因为其他领域的架构已经发布,并且可以免费获得许多基准。非常感谢GPU微测试人员,他们花时间和创造力在黑盒方法测试GPU设备中释放真实工作方案的真相。
1 ns _________ LETS SETUP A TIME/DISTANCE SCALE FIRST:
° ^
|\ |a 1 ft-distance a foton travels in vacuum ( less in dark-fibre )
| \ |
| \ |
__|___\__v____________________________________________________
| |
|<-->| a 1 ns TimeDOMAIN "distance", before a foton arrived
| |
^ v
DATA | |DATA
RQST'd| |RECV'd ( DATA XFER/FETCH latency )
25 ns @ 1147 MHz FERMI: GPU Streaming Multiprocessor REGISTER access
35 ns @ 1147 MHz FERMI: GPU Streaming Multiprocessor L1-onHit-[--8kB]CACHE
70 ns @ 1147 MHz FERMI: GPU Streaming Multiprocessor SHARED-MEM access
230 ns @ 1147 MHz FERMI: GPU Streaming Multiprocessor texL1-onHit-[--5kB]CACHE
320 ns @ 1147 MHz FERMI: GPU Streaming Multiprocessor texL2-onHit-[256kB]CACHE
350 ns
700 ns @ 1147 MHz FERMI: GPU Streaming Multiprocessor GLOBAL-MEM access
- - - - -
我为一个更大的图片&#34;而道歉,但是延迟 - 去掩蔽还有片上smREG / L1 / L2容量和命中/未命中的强制性限制率。
+====================| + 11-12 [usec] XFER-LATENCY-up HostToDevice ~~~ same as Intel X48 / nForce 790i
| |||||||||||||||||| + 10-11 [usec] XFER-LATENCY-down DeviceToHost
| |||||||||||||||||| ~ 5.5 GB/sec XFER-BW-up ~~~ same as DDR2/DDR3 throughput
| |||||||||||||||||| ~ 5.2 GB/sec XFER-BW-down @8192 KB TEST-LOAD ( immune to attempts to OverClock PCIe_BUS_CLK 100-105-110-115 [MHz] ) [D:4.9.3]
|
| Host-side
| cudaHostRegister( void *ptr, size_t size, unsigned int flags )
| | +-------------- cudaHostRegisterPortable -- marks memory as PINNED MEMORY for all CUDA Contexts, not just the one, current, when the allocation was performed
| ___HostAllocWriteCombined_MEM / cudaHostFree() +---------------- cudaHostRegisterMapped -- maps memory allocation into the CUDA address space ( the Device pointer can be obtained by a call to cudaHostGetDevicePointer( void **pDevice, void *pHost, unsigned int flags=0 ); )
| ___HostRegisterPORTABLE___MEM / cudaHostUnregister( void *ptr )
| ||||||||||||||||||
| ||||||||||||||||||
| | PCIe-2.0 ( 4x) | ~ 4 GB/s over 4-Lanes ( PORT #2 )
| | PCIe-2.0 ( 8x) | ~16 GB/s over 8-Lanes
| | PCIe-2.0 (16x) | ~32 GB/s over 16-Lanes ( mode 16x )
|
| + PCIe-3.0 25-port 97-lanes non-blocking SwitchFabric ... +over copper/fiber
| ~~~ The latest PCIe specification, Gen 3, runs at 8Gbps per serial lane, enabling a 48-lane switch to handle a whopping 96 GBytes/sec. of full duplex peer to peer traffic. [I:]
|
| ~810 [ns] + InRam-"Network" / many-to-many parallel CPU/Memory "message" passing with less than 810 ns latency any-to-any
|
| ||||||||||||||||||
| ||||||||||||||||||
+====================|
|.pci............HOST|
任何低延迟动机设计都必须反向设计&#34; I / O-hydraulics&#34; (因为0 1-XFER本质上是不可压缩的)并且由此产生的延迟决定了任何GPGPU解决方案的性能范围,因为它是计算密集型的(读取:其中处理成本更宽容,延迟更差XFERs ...)或者不是(读:在哪里(可能让某些人感到惊讶)CPU-s在端到端处理方面比GPU结构[引用可用]更快。
答案 4 :(得分:2)
看看这个“楼梯”情节,完美地说明不同的访问时间(就时钟抽搐而言)。注意红色CPU有一个额外的“步骤”,可能是因为它有L4(而其他人没有)。
Taken from this Extremetech article.
在计算机科学中,这被称为“I / O复杂性”。