首页 文章

为什么std :: pair比std :: tuple更快

提问于
浏览
42

这是测试代码 .

元组测试:

using namespace std;

int main(){

    vector<tuple<int,int>> v;


    for (int var = 0; var < 100000000; ++var) {
        v.push_back(make_tuple(var, var));
    }
}

配对测试:

#include <vector>

using namespace std;

int main(){

    vector<pair<int,int>> v;


    for (int var = 0; var < 100000000; ++var) {
        v.push_back(make_pair(var, var));
    }
}

我通过Linux time命令进行了时间测量 . 结果是:

|       |   -O0   |    -O2   |
|:------|:-------:|:--------:|
| Pair  |   8.9 s |  1.60 s  |
| Tuple |  19.8 s |  1.96 s  |

我想知道,为什么O0中的这两个数据结构之间存在如此大的差异,因为它们应该非常相似 . 02中只有一点不同 .

为什么O0的差异如此之大,为什么会有任何差异?

编辑:

v.resize()的代码

对:

#include <vector>

using namespace std;

int main(){

    vector<pair<int,int>> v;

    v.resize(100000000);

    for (int var = 0; var < 100000000; ++var) {
        v[var] = make_pair(var, var);
    }
}

元组:

#include<tuple>
#include<vector>

using namespace std;

int main(){

    vector<tuple<int,int>> v;

    v.resize(100000000);

    for (int var = 0; var < 100000000; ++var) {
        v[var] = make_tuple(var, var);
    }
}

结果:

|       |   -O0   |    -O2   |
|:------|:-------:|:--------:|
| Pair  |  5.01 s |  0.77 s  |
| Tuple |  10.6 s |  0.87 s  |

编辑:

我的系统

g++ (GCC) 4.8.3 20140911 (Red Hat 4.8.3-7)
GLIBCXX_3.4.19

2 回答

  • 63

    您缺少一些重要信息:您使用什么编译器?你用什么来衡量微基准的性能?您使用什么标准库实现?

    我的系统:

    g++ (GCC) 4.9.1 20140903 (prerelease)
    GLIBCXX_3.4.20
    

    无论如何,我运行你的例子,但首先保留适当大小的向量,以摆脱内存分配开销 . 有了它,我有趣地观察到相反的东西 - 与你看到的相反:

    g++ -std=c++11 -O2 pair.cpp -o pair
    perf stat -r 10 -d ./pair
    Performance counter stats for './pair' (10 runs):
    
          1647.045151      task-clock:HG (msec)      #    0.993 CPUs utilized            ( +-  1.94% )
                  346      context-switches:HG       #    0.210 K/sec                    ( +- 40.13% )
                    7      cpu-migrations:HG         #    0.004 K/sec                    ( +- 22.01% )
              182,978      page-faults:HG            #    0.111 M/sec                    ( +-  0.04% )
        3,394,685,602      cycles:HG                 #    2.061 GHz                      ( +-  2.24% ) [44.38%]
        2,478,474,676      stalled-cycles-frontend:HG #   73.01% frontend cycles idle     ( +-  1.24% ) [44.55%]
        1,550,747,174      stalled-cycles-backend:HG #   45.68% backend  cycles idle     ( +-  1.60% ) [44.66%]
        2,837,484,461      instructions:HG           #    0.84  insns per cycle        
                                                      #    0.87  stalled cycles per insn  ( +-  4.86% ) [55.78%]
          526,077,681      branches:HG               #  319.407 M/sec                    ( +-  4.52% ) [55.82%]
              829,623      branch-misses:HG          #    0.16% of all branches          ( +-  4.42% ) [55.74%]
          594,396,822      L1-dcache-loads:HG        #  360.887 M/sec                    ( +-  4.74% ) [55.59%]
            20,842,113      L1-dcache-load-misses:HG  #    3.51% of all L1-dcache hits    ( +-  0.68% ) [55.46%]
            5,474,166      LLC-loads:HG              #    3.324 M/sec                    ( +-  1.81% ) [44.23%]
      <not supported>      LLC-load-misses:HG       
    
          1.658671368 seconds time elapsed                                          ( +-  1.82% )
    

    与:

    g++ -std=c++11 -O2 tuple.cpp -o tuple
    perf stat -r 10 -d ./tuple
    Performance counter stats for './tuple' (10 runs):
    
            996.090514      task-clock:HG (msec)      #    0.996 CPUs utilized            ( +-  2.41% )
                  102      context-switches:HG       #    0.102 K/sec                    ( +- 64.61% )
                    4      cpu-migrations:HG         #    0.004 K/sec                    ( +- 32.24% )
              181,701      page-faults:HG            #    0.182 M/sec                    ( +-  0.06% )
        2,052,505,223      cycles:HG                 #    2.061 GHz                      ( +-  2.22% ) [44.45%]
        1,212,930,513      stalled-cycles-frontend:HG #   59.10% frontend cycles idle     ( +-  2.94% ) [44.56%]
          621,104,447      stalled-cycles-backend:HG #   30.26% backend  cycles idle     ( +-  3.48% ) [44.69%]
        2,700,410,991      instructions:HG           #    1.32  insns per cycle        
                                                      #    0.45  stalled cycles per insn  ( +-  1.66% ) [55.94%]
          486,476,408      branches:HG               #  488.386 M/sec                    ( +-  1.70% ) [55.96%]
              959,651      branch-misses:HG          #    0.20% of all branches          ( +-  4.78% ) [55.82%]
          547,000,119      L1-dcache-loads:HG        #  549.147 M/sec                    ( +-  2.19% ) [55.67%]
            21,540,926      L1-dcache-load-misses:HG  #    3.94% of all L1-dcache hits    ( +-  2.73% ) [55.43%]
            5,751,650      LLC-loads:HG              #    5.774 M/sec                    ( +-  3.60% ) [44.21%]
      <not supported>      LLC-load-misses:HG       
    
          1.000126894 seconds time elapsed                                          ( +-  2.47% )
    

    正如您所看到的,在我的情况下,原因是在前端和后端都有更多的停滞周期 .

    现在它来自哪里?我打赌它归结为一些失败的内联,类似于这里解释的:std::vector performance regression when enabling C++11

    确实,启用 -flto 为我 balancer 结果:

    Performance counter stats for './pair' (10 runs):
    
          1021.922944      task-clock:HG (msec)      #    0.997 CPUs utilized            ( +-  1.15% )
                    63      context-switches:HG       #    0.062 K/sec                    ( +- 77.23% )
                    5      cpu-migrations:HG         #    0.005 K/sec                    ( +- 34.21% )
              195,396      page-faults:HG            #    0.191 M/sec                    ( +-  0.00% )
        2,109,877,147      cycles:HG                 #    2.065 GHz                      ( +-  0.92% ) [44.33%]
        1,098,031,078      stalled-cycles-frontend:HG #   52.04% frontend cycles idle     ( +-  0.93% ) [44.46%]
          701,553,535      stalled-cycles-backend:HG #   33.25% backend  cycles idle     ( +-  1.09% ) [44.68%]
        3,288,420,630      instructions:HG           #    1.56  insns per cycle        
                                                      #    0.33  stalled cycles per insn  ( +-  0.88% ) [55.89%]
          672,941,736      branches:HG               #  658.505 M/sec                    ( +-  0.80% ) [56.00%]
              660,278      branch-misses:HG          #    0.10% of all branches          ( +-  2.05% ) [55.93%]
          474,314,267      L1-dcache-loads:HG        #  464.139 M/sec                    ( +-  1.32% ) [55.73%]
            19,481,787      L1-dcache-load-misses:HG  #    4.11% of all L1-dcache hits    ( +-  0.80% ) [55.51%]
            5,155,678      LLC-loads:HG              #    5.045 M/sec                    ( +-  1.69% ) [44.21%]
      <not supported>      LLC-load-misses:HG       
    
          1.025083895 seconds time elapsed                                          ( +-  1.03% )
    

    和元组:

    Performance counter stats for './tuple' (10 runs):
    
          1018.980969      task-clock:HG (msec)      #    0.999 CPUs utilized            ( +-  0.47% )
                    8      context-switches:HG       #    0.008 K/sec                    ( +- 29.74% )
                    3      cpu-migrations:HG         #    0.003 K/sec                    ( +- 42.64% )
              195,396      page-faults:HG            #    0.192 M/sec                    ( +-  0.00% )
        2,103,574,740      cycles:HG                 #    2.064 GHz                      ( +-  0.30% ) [44.28%]
        1,088,827,212      stalled-cycles-frontend:HG #   51.76% frontend cycles idle     ( +-  0.47% ) [44.56%]
          697,438,071      stalled-cycles-backend:HG #   33.15% backend  cycles idle     ( +-  0.41% ) [44.76%]
        3,305,631,646      instructions:HG           #    1.57  insns per cycle        
                                                      #    0.33  stalled cycles per insn  ( +-  0.21% ) [55.94%]
          675,175,757      branches:HG               #  662.599 M/sec                    ( +-  0.16% ) [56.02%]
              656,205      branch-misses:HG          #    0.10% of all branches          ( +-  0.98% ) [55.93%]
          475,532,976      L1-dcache-loads:HG        #  466.675 M/sec                    ( +-  0.13% ) [55.69%]
            19,430,992      L1-dcache-load-misses:HG  #    4.09% of all L1-dcache hits    ( +-  0.20% ) [55.49%]
            5,161,624      LLC-loads:HG              #    5.065 M/sec                    ( +-  0.47% ) [44.14%]
      <not supported>      LLC-load-misses:HG       
    
          1.020225388 seconds time elapsed                                          ( +-  0.48% )
    

    所以请记住, -flto 是你的朋友,失败的内联可能会在严格模板化的代码上产生极端的结果 . 使用 perf stat 找出发生了什么 .

  • 35

    milianw没有解决 -O0-O2 ,所以我想补充说明 .

    完全可以预期 std::tuple 在未经优化时将比 std::pair 慢,因为它是更复杂的对象 . 一对只有两个成员,所以它的方法很容易定义 . 但是元组有任意数量的成员,迭代模板参数列表的唯一方法是使用递归 . 因此,元组的大多数函数处理一个成员然后递归以处理其余成员,因此对于2元组,您有两倍的函数调用 .

    现在,当它们被优化时,编译器将内联递归并且不应存在显着差异 . 哪些测试明确证实 . 这适用于一般的模板化程度很高的东西 . 可以编写模板以提供没有或很少运行时开销的抽象,但是依赖于优化来内联所有简单的函数 .

相关问题