首页 文章

预取L1和L2的数据

提问于
浏览
12

在Agner Fog的手册Optimizing software in C++的第9.10节"Cahce contentions in large data structures"中,他描述了当矩阵宽度等于称为临界步幅的情况时转置矩阵的问题 . 在他的测试中,当宽度等于临界步幅时,L1中矩阵的成本增加40% . If the matrix is is even larger and only fits in L2 the cost is 600%! 这在表9.1中的文字中得到了很好的总结 . 这与Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?所观察到的相同

后来他写道:

对于二级高速缓存争用而言,这种影响比一级高速缓存争用强得多的原因是二级高速缓存不能一次预取多行 .

So my questions are related to prefetching data.

根据他的评论,我推断L1可以一次预取多个缓存行 . How many can it prefetch?

据我所知,尝试编写代码来预取数据(例如使用_mm_prefetch)很少有用 . 我读过的唯一例子是Prefetching Examples?,它只有O(10%)的改进(在某些机器上) . Agner后来解释了这个:

原因是现代处理器由于无序执行和高级预测机制而自动预取数据 . 现代微处理器能够自动预取包含具有不同步幅的多个流的常规访问模式的数据 . 因此,如果可以使用固定步幅以常规模式排列数据访问,则不必显式预取数据 .

So how does the CPU decide which data to prefetch and are there ways to help the CPU make better choices for the prefetching (e.g. "regular patterns with fixed strides")?

Edit: 根据Leeor的评论,让我添加我的问题并使其更有趣 . Why does the critical stride have so much more of an effect on L2 compared to L1?

Edit: 我试图使用Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?的代码重现Agner Fog的表格我在Xeon E5 1620(Ivy Bridge)上以MSVC2013 64位版本模式运行它,它具有L1 32KB 8路,L2 256 KB 8路和L3 10MB 20路 . L1的最大矩阵大小约为90x90,L3的最大矩阵大小为256x256,L3的最大矩阵大小为1619 .

Matrix Size  Average Time
64x64        0.004251 0.004472 0.004412 (three times)
65x65        0.004422 0.004442 0.004632 (three times)
128x128      0.0409
129x129      0.0169
256x256      0.219   //max L2 matrix size
257x257      0.0692
512x512      2.701
513x513      0.649
1024x1024    12.8
1025x1025    10.1

我没有看到L1中的任何性能损失,但是L2明显具有关键的步幅问题,可能是L3 . 我不确定为什么L1没有出现问题 . 有可能还有一些其他的背景源(开销)占据了L1时代的主导地位 .

1 回答

  • 10

    这个说法 :

    二级缓存一次不能预取多行 .

    是不正确的

    实际上,L2预取程序通常比L1预取程序更强大,更具攻击性 . 这取决于你使用的实际机器,但是取决于英特尔的L2预取器 . 可以为每个请求触发2个预取,而L1通常是有限的(有几种类型的预取可以在L1中共存,但它们可能在比L2可支配的更有限的BW上竞争,所以从L1出来的预取可能会更少 .

    optimization guide,在2.2.5.4节中计算以下预取器类型:

    Two hardware prefetchers load data to the L1 DCache:
    - Data cache unit (DCU) prefetcher: This prefetcher, also known as the streaming prefetcher, is triggered by an ascending access to very recently loaded data. The processor assumes that this access is part of a streaming algorithm and automatically fetches the next line.
    - Instruction pointer (IP)-based stride prefetcher: This prefetcher keeps track of individual load instructions. If a load instruction is detected to have a regular stride, then a prefetch is sent to the next address which is the sum of the current address and the stride. This prefetcher can prefetch forward or backward and can detect strides of up to 2K bytes.
    
     Data Prefetch to the L2 and Last Level Cache - 
     - Spatial Prefetcher: This prefetcher strives to complete every cache line fetched to  the L2 cache with the pair line that completes it to a 128-byte aligned chunk.
     - Streamer: This prefetcher monitors read requests from the L1 cache for ascending and descending sequences of addresses. Monitored read requests include L1 DCache requests initiated by load and store operations and by the hardware prefetchers, and L1 ICache requests for code fetch. When a forward or backward stream of requests is detected, the anticipated cache lines are prefetched. Prefetched cache lines must be in the same 4K page.
    

    还有一点:

    ... The streamer may issue two prefetch requests on every L2 lookup. The streamer can run up to 20 lines ahead of the load request.
    

    在上面,只有基于IP的可以处理大于一个高速缓存行的步幅(流式处理可以处理使用连续高速缓存行的任何内容,意味着高达64字节的步幅(或者实际上高达128字节,如果你不介意一些额外的为了使用它,确保在给定地址处的加载/存储将执行跨步访问 - 通常已经在遍历数组的循环中 . 编译器循环展开可以将其拆分为具有更大步幅的多个不同步幅流 - 会更好地工作(前瞻会更大),除非你超过未完成的跟踪IP的数量 - 再次,这取决于确切的实现 .

    但是,如果您的访问模式由连续行组成,则L2流式传输器比L1更有效,因为它可以更快地运行 .

相关问题