我试图使用perf和ocperf在我的代码中 Build 瓶颈 . 如果我在我的二进制文件上运行'detailed stat',则会以红色文本报告两个统计信息,我认为这意味着它太高了 .
L1-dcache-load-misss为红色,为28.60%
iTLB-load-miss为红色,为425.89%
# ~bram/src/pmu-tools/ocperf.py stat -d -d -d -d -d ./bench ray
perf stat -d -d -d -d -d ./bench ray
Loaded 455 primitives.
Testing ray against 455 primitives.
Performance counter stats for './bench ray':
9031.444612 task-clock (msec) # 1.000 CPUs utilized
15 context-switches # 0.002 K/sec
0 cpu-migrations # 0.000 K/sec
292 page-faults # 0.032 K/sec
28,786,063,163 cycles # 3.187 GHz (61.47%)
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
55,742,952,563 instructions # 1.94 insns per cycle (69.18%)
3,717,242,560 branches # 411.589 M/sec (69.18%)
18,097,580 branch-misses # 0.49% of all branches (69.18%)
10,230,376,136 L1-dcache-loads # 1132.751 M/sec (69.17%)
2,926,349,754 L1-dcache-load-misses # 28.60% of all L1-dcache hits (69.21%)
145,843,523 LLC-loads # 16.148 M/sec (69.32%)
49,512 LLC-load-misses # 0.07% of all LL-cache hits (69.33%)
<not supported> L1-icache-loads
260,144 L1-icache-load-misses # 0.029 M/sec (69.34%)
10,230,376,830 dTLB-loads # 1132.751 M/sec (69.34%)
1,197 dTLB-load-misses # 0.00% of all dTLB cache hits (61.59%)
2,294 iTLB-loads # 0.254 K/sec (61.55%)
9,770 iTLB-load-misses # 425.89% of all iTLB cache hits (61.51%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
9.032234014 seconds time elapsed
我的问题:
-
L1数据缓存未命中的合理数字是多少?
-
iTLB-load-miss的合理数字是多少?
-
为什么iTLB负载未命中超过100%?换句话说:为什么iTLB负载未命中超过iTLB负载?我甚至看到它飙升高达568%
此外,我的机器有一个Haswell CPU . 我原本预计会包含陷入停滞的周期数据吗?