Question

在Agner Fog的手册Optimizing software in C++的第9.10节“大数据结构中的Cahce争论”中，他描述了当矩阵宽度等于称为临界步幅的情况时转置矩阵的问题。在他的测试中，当宽度等于临界步幅时，L1中矩阵的成本增加40％。 如果矩阵更大且仅适用于L2，则成本为600％！这在表9.1中的文字中得到了很好的总结。这与观察到的相同是至关重要的 Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?

后来他写道：

这种效果之所以如此强大级别2缓存争用而不是级别1缓存争用是二级缓存不能一次预取多行。

所以我的问题与预取数据有关。

根据他的评论，我推断L1可以一次预取多个缓存行。 预取了多少？

据我所知，尝试编写代码来预取数据（例如使用_mm_prefetch）很少有用。我读过的唯一例子是Prefetching Examples?，它只有O（10％）的改进（在某些机器上）。阿格纳后来解释了这一点：

原因是现代处理器会自动预取数据无序执行和高级预测机制。现代微处理器是能够自动预取包含多个流的常规访问模式的数据不同的步伐。因此，如果数据访问可以，则不必显式预取数据以固定的步幅排列成规则的图案。

那么CPU如何决定预取哪些数据？是否有办法帮助CPU做出更好的预取选择（例如“固定步幅的常规模式”）？

编辑：根据Leeor的评论，让我添加我的问题并使其更有趣。 与L1相比，为什么关键步幅对L2的影响要大得多？

编辑：我尝试使用Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?处的代码重现Agner Fog的表格我在Xeon E5 1620（Ivy Bridge）上以MSVC2013 64位版本模式运行它，它具有L1 32KB 8路，L2 256 KB 8路和L3 10MB 20路。 L1的最大矩阵大小约为90x90，L3为256x256，L3为1619。

Matrix Size  Average Time
64x64        0.004251 0.004472 0.004412 (three times)
65x65        0.004422 0.004442 0.004632 (three times)
128x128      0.0409
129x129      0.0169
256x256      0.219   //max L2 matrix size
257x257      0.0692
512x512      2.701
513x513      0.649
1024x1024    12.8
1025x1025    10.1

我没有看到L1中的任何性能损失，但是L2明显存在关键步幅问题，可能是L3。我不确定为什么L1没有出现问题。可能还有一些其他的背景源（开销）占据了L1时代。

Answer 1

本声明：

二级缓存一次不能预取多行。

不正确

事实上，L2预取程序通常比L1预取程序更强大，更具攻击性。这取决于你使用的实际机器，但是取决于英特尔的L2预取器。可以为每个请求触发2个预取，而L1通常是有限的（有几种类型的预取可以在L1中共存，但它们可能在比L2可支配的更有限的BW上竞争，所以从L1中可能会有更少的预取。

第2.3.5.4节（数据预取）中的optimization guide计算以下预取器类型：

Two hardware prefetchers load data to the L1 DCache:
- Data cache unit (DCU) prefetcher: This prefetcher, also known as the streaming prefetcher, is triggered by an ascending access to very recently loaded data. The processor assumes that this access is part of a streaming algorithm and automatically fetches the next line.
- Instruction pointer (IP)-based stride prefetcher: This prefetcher keeps track of individual load instructions. If a load instruction is detected to have a regular stride, then a prefetch is sent to the next address which is the sum of the current address and the stride. This prefetcher can prefetch forward or backward and can detect strides of up to 2K bytes.

 Data Prefetch to the L2 and Last Level Cache - 
 - Spatial Prefetcher: This prefetcher strives to complete every cache line fetched to  the L2 cache with the pair line that completes it to a 128-byte aligned chunk.
 - Streamer: This prefetcher monitors read requests from the L1 cache for ascending and descending sequences of addresses. Monitored read requests include L1 DCache requests initiated by load and store operations and by the hardware prefetchers, and L1 ICache requests for code fetch. When a forward or backward stream of requests is detected, the anticipated cache lines are prefetched. Prefetched cache lines must be in the same 4K page.

还有一点：

... The streamer may issue two prefetch requests on every L2 lookup. The streamer can run up to 20 lines ahead of the load request.

在上面，只有基于IP的可以处理大于一个高速缓存行的步幅（流式处理可以处理使用连续高速缓存行的任何内容，意味着高达64字节的步幅（或者实际上高达128字节，如果你不请注意一些额外的行。要使用它，请确保给定地址处的加载/存储将执行跨步访问 - 通常情况下已经在遍历数组的循环中。编译器循环展开可以将其拆分为多个不同的步幅流大步 - 这将更好地工作（前瞻会更大），除非你超过未完成的跟踪IP的数量 - 再次，这取决于具体的实现。

但是，如果您的访问模式确实由连续行组成，则L2流式传输器比L1更有效，因为它可以更快地运行。

预取L1和L2的数据

1 个答案: