任何人都可以给我大概的时间(以纳秒为单位)来访问L1,L2和L3缓存,以及Intel i7处理器上的主内存吗?
虽然这不是一个特别的编程问题,但是对于某些低延迟编程挑战而言,了解这些速度细节是必要的.
每个人应该知道的数字
0.5 ns - CPU L1 dCACHE reference 1 ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance 5 ns - CPU L1 iCACHE Branch mispredict 7 ns - CPU L2 CACHE reference 71 ns - CPU cross-QPI/NUMA best case on XEON E5-46* 100 ns - MUTEX lock/unlock 100 ns - own DDR MEMORY reference 135 ns - CPU cross-QPI/NUMA best case on XEON E7-* 202 ns - CPU cross-QPI/NUMA worst case on XEON E7-* 325 ns - CPU cross-QPI/NUMA worst case on XEON E5-46* 10,000 ns - Compress 1K bytes with Zippy PROCESS 20,000 ns - Send 2K bytes over 1 Gbps NETWORK 250,000 ns - Read 1 MB sequentially from MEMORY 500,000 ns - Round trip within a same DataCenter 10,000,000 ns - DISK seek 10,000,000 ns - Read 1 MB sequentially from NETWORK 30,000,000 ns - Read 1 MB sequentially from DISK 150,000,000 ns - Send a NETWORK packet CA -> Netherlands | | | | | | | ns| | | us| | ms|
来源:最初由彼得·诺维格:
- http://norvig.com/21-days.html#answers
- http://surana.wordpress.com/2009/01/01/numbers-everyone-should-know/,
- http://sites.google.com/site/io/building-scalable-web-applications-with-google-app-engine
以下是 i7和Xeon系列处理器的性能分析指南.我应该强调,这有你需要的东西和更多(例如,检查第22页的某些时间和周期).
此外,此页面还有一些关于时钟周期等的详细信息.第二个链接提供以下数字:
Core i7 Xeon 5500 Series Data Source Latency (approximate) [Pg. 22] local L1 CACHE hit, ~4 cycles ( 2.1 - 1.2 ns ) local L2 CACHE hit, ~10 cycles ( 5.3 - 3.0 ns ) local L3 CACHE hit, line unshared ~40 cycles ( 21.4 - 12.0 ns ) local L3 CACHE hit, shared line in another core ~65 cycles ( 34.8 - 19.5 ns ) local L3 CACHE hit, modified in another core ~75 cycles ( 40.2 - 22.5 ns ) remote L3 CACHE (Ref: Fig.1 [Pg. 5]) ~100-300 cycles ( 160.7 - 30.0 ns ) local DRAM ~60 ns remote DRAM ~100 ns
EDIT2
:
最重要的是引用表下的通知,说:
"注意:这些值是粗略的近似.他们依赖核心和UNCORE频率,记忆速度,BIOS设置,墨水数量等等.你的里程可能会有所不同. "
编辑:我应该强调,除了时序/周期信息,上面的英特尔文档还解决了i7和Xeon系列处理器的更多(非常)有用的细节(从性能的角度来看).
请参阅此页面,了解从1990年到2020年的内存延迟减少情况.
自2005年以来,价值已经下降但趋于稳定
1 ns L1 cache 3 ns Branch mispredict 4 ns L2 cache 17 ns Mutex lock/unlock 100 ns Main memory (RAM) 2 000 ns (2µs) 1KB Zippy-compress
还有一些改进,预测到2020年
16 000 ns (16µs) SSD random read (olibre's note: should be less) 500 000 ns (½ms) Round trip in datacenter 2 000 000 ns (2ms) HDD random read (seek)
每个程序员应该从Ulrich Drepper那里了解内存(2007)
Old,但仍然是关于内存硬件和软件交互的极好的深刻解释.
完整PDF(114页)
关于LWN关于PDF版本的评论
另外 的人
关于LWN +评论的七篇帖子
第1部分 - 简介
第2部分 - 缓存
第3部分 - 虚拟内存
第4部分 - NUMA支持
第5部分 - 程序员可以做什么
第6部分 - 程序员可以做的更多事情
第7部分 - 内存性能工具
发布词之间的无限空间基础上的书codinghorror.com 系统的性能:企业和云
单击http://www.7-cpu.com/上列出的每个处理器以查看L1/L2/L3/RAM/...延迟(例如,Haswell i7-4770具有L1 = 1ns,L2 = 3ns,L3 = 10ns,RAM = 67ns,BranchMisprediction = 4ns)
http://idarkside.org/posts/numbers-you-should-know/
如需进一步了解,我推荐的优秀现代高速缓存架构呈现从(2014年6月),格哈德·Wellein,汉内斯·霍夫曼和迪特马尔·法伊在大学埃尔兰根-纽伦堡.
讲法语的人可能会欣赏SpaceFox的一篇文章,比较处理器和开发人员都在等待继续工作所需的信息.
Still some improvements, prediction for 2020 (Ref. olibre's answer below) ------------------------------------------------------------------------- 16 000 ns ( 16 µs) SSD random read (olibre's note: should be less) 500 000 ns ( ½ ms) Round trip in datacenter 2 000 000 ns ( 2 ms) HDD random read (seek) In 2015 there are currently available: ======================================================================== 820 ns ( 0.8µs) random read from a SSD-DataPlane 1 200 ns ( 1.2µs) Round trip in datacenter 1 200 ns ( 1.2µs) random read from a HDD-DataPlane
即使是最简单的CPU /缓存/ DRAM阵容(即使在统一的内存访问模型中)也不是一项容易的任务,其中DRAM速度是确定延迟的因素,加载延迟(饱和系统),后者规则和企业应用程序将体验的不仅仅是空闲的完全卸载系统.
+----------------------------------- 5,6,7,8,9,..12,15,16 | +--- 1066,1333,..2800..3300 v v First word = ( ( CAS latency * 2 ) + ( 1 - 1 ) ) / Data Rate Fourth word = ( ( CAS latency * 2 ) + ( 4 - 1 ) ) / Data Rate Eighth word = ( ( CAS latency * 2 ) + ( 8 - 1 ) ) / Data Rate ^----------------------- 7x .. difference ******************************** So: === resulting DDR3-side latencies are between _____________ 3.03 ns ^ | 36.58 ns ___v_ based on DDR3 HW facts
GPU引擎已经获得了大量的技术营销,而深层内部依赖关系是理解这些架构在实践中体验的真正优势和真正弱点的关键(通常与积极的营销吹口哨的期望大不相同).
1 ns _________ LETS SETUP A TIME/DISTANCE SCALE FIRST: ° ^ |\ |a 1 ft-distance a foton travels in vacuum ( less in dark-fibre ) | \ | | \ | __|___\__v____________________________________________________ | | |<-->| a 1 ns TimeDOMAIN "distance", before a foton arrived | | ^ v DATA | |DATA RQST'd| |RECV'd ( DATA XFER/FETCH latency ) 25 ns @ 1147 MHz FERMI: GPU Streaming Multiprocessor REGISTER access 35 ns @ 1147 MHz FERMI: GPU Streaming Multiprocessor L1-onHit-[--8kB]CACHE 70 ns @ 1147 MHz FERMI: GPU Streaming Multiprocessor SHARED-MEM access 230 ns @ 1147 MHz FERMI: GPU Streaming Multiprocessor texL1-onHit-[--5kB]CACHE 320 ns @ 1147 MHz FERMI: GPU Streaming Multiprocessor texL2-onHit-[256kB]CACHE 350 ns 700 ns @ 1147 MHz FERMI: GPU Streaming Multiprocessor GLOBAL-MEM access - - - - -
因此,了解内部性比其他领域更重要,因为其他领域的架构已经发布,并且可以免费获得许多基准.非常感谢GPU微测试人员,他们花时间和创造力在黑盒方法测试的GPU设备中释放真正的工作方案的真相.
+====================| + 11-12 [usec] XFER-LATENCY-up HostToDevice ~~~ same as Intel X48 / nForce 790i | |||||||||||||||||| + 10-11 [usec] XFER-LATENCY-down DeviceToHost | |||||||||||||||||| ~ 5.5 GB/sec XFER-BW-up ~~~ same as DDR2/DDR3 throughput | |||||||||||||||||| ~ 5.2 GB/sec XFER-BW-down @8192 KB TEST-LOAD ( immune to attempts to OverClock PCIe_BUS_CLK 100-105-110-115 [MHz] ) [D:4.9.3] | | Host-side | cudaHostRegister( void *ptr, size_t size, unsigned int flags ) | | +-------------- cudaHostRegisterPortable -- marks memory as PINNED MEMORY for all CUDA Contexts, not just the one, current, when the allocation was performed | ___HostAllocWriteCombined_MEM / cudaHostFree() +---------------- cudaHostRegisterMapped -- maps memory allocation into the CUDA address space ( the Device pointer can be obtained by a call to cudaHostGetDevicePointer( void **pDevice, void *pHost, unsigned int flags=0 ); ) | ___HostRegisterPORTABLE___MEM / cudaHostUnregister( void *ptr ) | |||||||||||||||||| | |||||||||||||||||| | | PCIe-2.0 ( 4x) | ~ 4 GB/s over 4-Lanes ( PORT #2 ) | | PCIe-2.0 ( 8x) | ~16 GB/s over 8-Lanes | | PCIe-2.0 (16x) | ~32 GB/s over 16-Lanes ( mode 16x ) | | + PCIe-3.0 25-port 97-lanes non-blocking SwitchFabric ... +over copper/fiber | ~~~ The latest PCIe specification, Gen 3, runs at 8Gbps per serial lane, enabling a 48-lane switch to handle a whopping 96 GBytes/sec. of full duplex peer to peer traffic. [I:] | | ~810 [ns] + InRam-"Network" / many-to-many parallel CPU/Memory "message" passing with less than 810 ns latency any-to-any | | |||||||||||||||||| | |||||||||||||||||| +====================| |.pci............HOST|
我对"更大的图片"表示道歉,但是延迟 - 去掩码也受到片上smREG/L1/L2容量和命中/未命中率的限制.
|.pci............GPU.| | | FERMI [GPU-CLK] ~ 0.9 [ns] but THE I/O LATENCIES PAR -- ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| <800> warps ~~ 24000 + 3200 threads ~~ 27200 threads [!!] | ^^^^^^^^|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [!!] | smREGs________________________________________ penalty +400 ~ +800 [GPU_CLKs] latency ( maskable by 400~800 WARPs ) on-designed spillover(s) to locMEM__ | +350 ~ +700 [ns] @1147 MHz FERMI ^^^^^^^^ | | ^^^^^^^^ | +5 [ns] @ 200 MHz FPGA. . . . . . Xilinx/Zync Z7020/FPGA massive-parallel streamline-computing mode ev. PicoBlazer softCPU | | ^^^^^^^^ | ~ +20 [ns] @1147 MHz FERMI ^^^^^^^^ | SM-REGISTERs/thread: max 63 for CC-2.x -with only about +22 [GPU_CLKs] latency ( maskable by 22-WARPs ) to hide on [REGISTER DEPENDENCY] when arithmetic result is to be served from previous [INSTR] [G]:10.4, Page-46 | max 63 for CC-3.0 - about +11 [GPU_CLKs] latency ( maskable by 44-WARPs ) [B]:5.2.3, Page-73 | max 128 for CC-1.x PAR -- ||||||||~~~| | max 255 for CC-3.5 PAR -- ||||||||||||||||||~~~~~~| | | smREGs___BW ANALYZE REAL USE-PATTERNs IN PTX-creation PHASE << -Xptxas -v || nvcc -maxrregcount ( w|w/o spillover(s) ) | with about 8.0 TB/s BW [C:Pg.46] | 1.3 TB/s BW shaMEM___ 4B * 32banks * 15 SMs * half 1.4GHz = 1.3 TB/s only on FERMI | 0.1 TB/s BW gloMEM___ | ________________________________________________________________________________________________________________________________________________________________________________________________________________________ +========| DEVICE:3 PERSISTENT gloMEM___ | _|______________________________________________________________________________________________________________________________________________________________________________________________________________________ +======| DEVICE:2 PERSISTENT gloMEM___ | _|______________________________________________________________________________________________________________________________________________________________________________________________________________________ +====| DEVICE:1 PERSISTENT gloMEM___ | _|______________________________________________________________________________________________________________________________________________________________________________________________________________________ +==| DEVICE:0 PERSISTENT gloMEM_____________________________________________________________________+440 [GPU_CLKs]_________________________________________________________________________|_GB| ! | |\ + | o | texMEM___|_\___________________________________texMEM______________________+_______________________________________________________________________________________|_MB| | |\ \ |\ + |\ | | texL2cache_| \ \ .| \_ _ _ _ _ _ _ _texL2cache +370 [GPU_CLKs] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ | \ 256_KB| | | \ \ | \ + |\ ^ \ | | | \ \ | \ + | \ ^ \ | | | \ \ | \ + | \ ^ \ | | texL1cache_| \ \ .| \_ _ _ _ _ _texL1cache +260 [GPU_CLKs] _ _ _ _ _ _ _ _ _ | \_ _ _ _ _^ \ 5_KB| | | \ \ | \ + ^\ ^ \ ^\ \ | | shaMEM + conL3cache_| \ \ | \ _ _ _ _ conL3cache +220 [GPU_CLKs] ^ \ ^ \ ^ \ \ 32_KB| | | \ \ | \ ^\ + ^ \ ^ \ ^ \ \ | | | \ \ | \ ^ \ + ^ \ ^ \ ^ \ \ | | ______________________|__________\_\_______________________|__________\_____^__\________+__________________________________________\_________\_____\________________________________| | +220 [GPU-CLKs]_| |_ _ _ ___|\ \ \_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \ _ _ _ _\_ _ _ _+220 [GPU_CLKs] on re-use at some +50 GPU_CLKs _IF_ a FETCH from yet-in-shaL2cache | L2-on-re-use-only +80 [GPU-CLKs]_| 64 KB L2_|_ _ _ __|\\ \ \_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \ _ _ _ _\_ _ _ + 80 [GPU_CLKs] on re-use from L1-cached (HIT) _IF_ a FETCH from yet-in-shaL1cache | L1-on-re-use-only +40 [GPU-CLKs]_| 8 KB L1_|_ _ _ _|\\\ \_\__________________________________\________\_____+ 40 [GPU_CLKs]_____________________________________________________________________________| | L1-on-re-use-only + 8 [GPU-CLKs]_| 2 KB L1_|__________|\\\\__________\_\__________________________________\________\____+ 8 [GPU_CLKs]_________________________________________________________conL1cache 2_KB| | on-chip|smREG +22 [GPU-CLKs]_| |t[0_______^:~~~~~~~~~~~~~~~~\:________] |CC- MAX |_|_|_|_|_|_|_|_|_|_|_| |t[1_______^ :________] |2.x 63 |_|_|_|_|_|_|_|_|_|_|_| |t[2_______^ :________] |1.x 128 |_|_|_|_|_|_|_|_|_|_|_| |t[3_______^ :________] |3.5 255 REGISTERs|_|_|_|_|_|_|_|_| |t[4_______^ :________] | per|_|_|_|_|_|_|_|_|_|_|_| |t[5_______^ :________] | Thread_|_|_|_|_|_|_|_|_|_| |t[6_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| |t[7_______^ 1stHalf-WARP :________]______________ | |_|_|_|_|_|_|_|_|_|_|_| |t[ 8_______^:~~~~~~~~~~~~~~~~~:________] | |_|_|_|_|_|_|_|_|_|_|_| |t[ 9_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| |t[ A_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| |t[ B_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| |t[ C_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| |t[ D_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| |t[ E_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| W0..|t[ F_______^____________WARP__:________]_____________ | |_|_|_|_|_|_|_|_|_|_|_| .............. | |_|_|_|_|_|_|_|_|_|_|_| ............|t[0_______^:~~~~~~~~~~~~~~~\:________] | |_|_|_|_|_|_|_|_|_|_|_| ............|t[1_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| ............|t[2_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| ............|t[3_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| ............|t[4_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| ............|t[5_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| ............|t[6_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| ............|t[7_______^ 1stHalf-WARP :________]______________ | |_|_|_|_|_|_|_|_|_|_|_| ............|t[ 8_______^:~~~~~~~~~~~~~~~~:________] | |_|_|_|_|_|_|_|_|_|_|_| ............|t[ 9_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| ............|t[ A_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| ............|t[ B_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| ............|t[ C_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| ............|t[ D_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| ............|t[ E_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| W1..............|t[ F_______^___________WARP__:________]_____________ | |_|_|_|_|_|_|_|_|_|_|_| .................................................... | |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[0_______^:~~~~~~~~~~~~~~~\:________] | |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[1_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[2_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[3_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[4_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[5_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[6_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[7_______^ 1stHalf-WARP :________]______________ | |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[ 8_______^:~~~~~~~~~~~~~~~~:________] | |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[ 9_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[ A_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[ B_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[ C_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[ D_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[ E_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_|tBlock Wn....................................................|t[ F_______^___________WARP__:________]_____________ | | ________________ °°°°°°°°°°°°°°°°°°°°°°°°°°~~~~~~~~~~°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°° | / \ CC-2.0|||||||||||||||||||||||||| ~masked |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| | / \ 1.hW ^|^|^|^|^|^|^|^|^|^|^|^|^| -s ^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^| | / \ 2.hW |^|^|^|^|^|^|^|^|^|^|^|^|^ |^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^ |_______________/ \______I|I|I|I|I|I|I|I|I|I|I|I|I|~~~~~~~~~~I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I| |~~~~~~~~~~~~~~/ SM:0.warpScheduler /~~~~~~~I~I~I~I~I~I~I~I~I~I~I~I~I~~~~~~~~~~~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I | \ | // | \ RR-mode // | \ GREEDY-mode // | \________________// | \______________/SM:0__________________________________________________________________________________ | | |t[ F_______^___________WARP__:________]_______ | ..|SM:1__________________________________________________________________________________ | | |t[ F_______^___________WARP__:________]_______ | ..|SM:2__________________________________________________________________________________ | | |t[ F_______^___________WARP__:________]_______ | ..|SM:3__________________________________________________________________________________ | | |t[ F_______^___________WARP__:________]_______ | ..|SM:4__________________________________________________________________________________ | | |t[ F_______^___________WARP__:________]_______ | ..|SM:5__________________________________________________________________________________ | | |t[ F_______^___________WARP__:________]_______ | ..|SM:6__________________________________________________________________________________ | | |t[ F_______^___________WARP__:________]_______ | ..|SM:7__________________________________________________________________________________ | | |t[ F_______^___________WARP__:________]_______ | ..|SM:8__________________________________________________________________________________ | | |t[ F_______^___________WARP__:________]_______ | ..|SM:9__________________________________________________________________________________ | ..|SM:A |t[ F_______^___________WARP__:________]_______ | ..|SM:B |t[ F_______^___________WARP__:________]_______ | ..|SM:C |t[ F_______^___________WARP__:________]_______ | ..|SM:D |t[ F_______^___________WARP__:________]_______ | |_______________________________________________________________________________________ */
任何低延迟动机设计必须相当反向工程"I/O-液压"(如0 1-XFERs是由性质不可压缩),将所得的延迟排除任何GPGPU溶液中的性能包络线是它计算密集的(读:处理成本宽容多一点,延迟时间较差XFERs ......)或者没有(读取:在哪里(可能让某人感到惊讶)CPU-s在端到端处理方面比GPU 架构更快[引用可用] ).