首页 文章

如何实现每个周期4个FLOP的理论最大值?

提问于
浏览
570

如何在现代x86-64 Intel CPU上实现每个周期4个浮点运算(双精度)的理论峰值性能?

据我所知,在大多数现代Intel CPU上完成SSE add 需要三个周期,而 mul 需要五个周期(例如参见Agner Fog's 'Instruction Tables') . 由于流水线操作,如果算法至少有三个独立的求和,则每个周期的吞吐量可以达到一个 add . 由于打包 addpd 以及标量 addsd 版本和SSE寄存器都可以包含两个 double ,因此每个周期的吞吐量可能高达两个触发器 .

此外,似乎(虽然我没有看到任何适当的文档) addmul 可以并行执行,给出每周期四个触发器的理论最大吞吐量 .

但是,我无法用简单的C / C程序复制该性能 . 我最好的尝试导致大约2.7个翻牌/周期 . 如果任何人都可以贡献一个简单的C / C或汇编程序,它可以表现出非常高兴的峰值性能 .

我的尝试:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <sys/time.h>

double stoptime(void) {
   struct timeval t;
   gettimeofday(&t,NULL);
   return (double) t.tv_sec + t.tv_usec/1000000.0;
}

double addmul(double add, double mul, int ops){
   // Need to initialise differently otherwise compiler might optimise away
   double sum1=0.1, sum2=-0.1, sum3=0.2, sum4=-0.2, sum5=0.0;
   double mul1=1.0, mul2= 1.1, mul3=1.2, mul4= 1.3, mul5=1.4;
   int loops=ops/10;          // We have 10 floating point operations inside the loop
   double expected = 5.0*add*loops + (sum1+sum2+sum3+sum4+sum5)
               + pow(mul,loops)*(mul1+mul2+mul3+mul4+mul5);

   for (int i=0; i<loops; i++) {
      mul1*=mul; mul2*=mul; mul3*=mul; mul4*=mul; mul5*=mul;
      sum1+=add; sum2+=add; sum3+=add; sum4+=add; sum5+=add;
   }
   return  sum1+sum2+sum3+sum4+sum5+mul1+mul2+mul3+mul4+mul5 - expected;
}

int main(int argc, char** argv) {
   if (argc != 2) {
      printf("usage: %s <num>\n", argv[0]);
      printf("number of operations: <num> millions\n");
      exit(EXIT_FAILURE);
   }
   int n = atoi(argv[1]) * 1000000;
   if (n<=0)
       n=1000;

   double x = M_PI;
   double y = 1.0 + 1e-8;
   double t = stoptime();
   x = addmul(x, y, n);
   t = stoptime() - t;
   printf("addmul:\t %.3f s, %.3f Gflops, res=%f\n", t, (double)n/t/1e9, x);
   return EXIT_SUCCESS;
}

编译

g++ -O2 -march=native addmul.cpp ; ./a.out 1000

在英特尔酷睿i5-750,2.66 GHz上产生以下输出 .

addmul:  0.270 s, 3.707 Gflops, res=1.326463

也就是说,每个周期只有大约1.4个触发器 . 用 g++ -S -O2 -march=native -masm=intel addmul.cpp 查看汇编程序代码,主循环似乎对我来说是最优的:

.L4:
inc    eax
mulsd    xmm8, xmm3
mulsd    xmm7, xmm3
mulsd    xmm6, xmm3
mulsd    xmm5, xmm3
mulsd    xmm1, xmm3
addsd    xmm13, xmm2
addsd    xmm12, xmm2
addsd    xmm11, xmm2
addsd    xmm10, xmm2
addsd    xmm9, xmm2
cmp    eax, ebx
jne    .L4

使用打包版本( addpdmulpd )更改标量版本会使翻牌计数加倍,而不会改变执行时间,因此每个周期我只能获得2.8个翻牌 . 有一个简单的例子,每个周期实现四次触发吗?

神秘的小程序;这是我的结果(虽然运行了几秒钟):

  • gcc -O2 -march=nocona :5.6 Gflops超过10.66 Gflops(2.1个翻牌/周期)

  • cl /O2 ,openmp删除:10.1 Gflops超过10.66 Gflops(3.8个触发/周期)

这看起来有点复杂,但到目前为止我的结论:

  • gcc -O2 更改独立浮点运算的顺序,目的是在可能的情况下交替 addpdmulpd . 同样适用于 gcc-4.6.2 -O2 -march=core2 .

  • gcc -O2 -march=nocona 似乎保持C源中定义的浮点运算的顺序 .

  • cl /O2 ,来自SDK for Windows 7的64位编译器自动循环展开并且似乎尝试安排操作,以便三个 addpd 的组与三个 mulpd 交替(好吧,至少在我的系统上和我的简单程序) ) .

  • 我的Core i5 750Nahelem architecture)没有't like alternating add'和mul 's and seems unable to run both operations in parallel. However, if grouped in 3'它突然像魔法一样工作 .

  • 其他体系结构(可能是Sandy Bridge和其他体系结构)似乎能够并行执行add / mul而不会出现问题(如果它们在汇编代码中交替) .

  • 虽然难以承认,但在我的系统上 cl /O2 在我的系统的低级优化操作方面做得更好,并且在上面的小C示例中实现了接近峰值的性能 . 我测量的是1.85-2.01 flops / cycle(在Windows中使用了clock()并不精确 . 我猜,需要使用更好的计时器 - 感谢Mackie Messer) .

  • 我用 gcc 管理的最好的是手动循环展开并以三个为一组排列添加和乘法 . 随着 g++ -O2 -march=nocona addmul_unroll.cpp 我得到了最好的 0.207s, 4.825 Gflops ,这对应于1.8翻牌/周期,我现在很高兴 .

在C代码中,我用 for 循环替换了

for (int i=0; i<loops/3; i++) {
       mul1*=mul; mul2*=mul; mul3*=mul;
       sum1+=add; sum2+=add; sum3+=add;
       mul4*=mul; mul5*=mul; mul1*=mul;
       sum4+=add; sum5+=add; sum1+=add;

       mul2*=mul; mul3*=mul; mul4*=mul;
       sum2+=add; sum3+=add; sum4+=add;
       mul5*=mul; mul1*=mul; mul2*=mul;
       sum5+=add; sum1+=add; sum2+=add;

       mul3*=mul; mul4*=mul; mul5*=mul;
       sum3+=add; sum4+=add; sum5+=add;
   }

现在装配看起来像

.L4:
mulsd    xmm8, xmm3
mulsd    xmm7, xmm3
mulsd    xmm6, xmm3
addsd    xmm13, xmm2
addsd    xmm12, xmm2
addsd    xmm11, xmm2
mulsd    xmm5, xmm3
mulsd    xmm1, xmm3
mulsd    xmm8, xmm3
addsd    xmm10, xmm2
addsd    xmm9, xmm2
addsd    xmm13, xmm2
...

4 回答

  • 29

    人们经常忘记英特尔架构中的一点,调度端口在Int和FP / SIMD之间共享 . 这意味着在循环逻辑在浮点流中创建气泡之前,您将只获得一定数量的FP / SIMD突发 . 神秘主义者从他的代码中获得了更多的失败,因为他在展开的循环中使用了更长的步幅 .

    如果你看看这里的Nehalem / Sandy Bridge建筑http://www.realworldtech.com/page.cfm?ArticleID=RWT091810191937&p=6,很清楚会发生什么 .

    相比之下,由于INT和FP / SIMD管道具有独立的发布端口和自己的调度程序,因此在AMD(Bulldozer)上应该更容易达到峰值性能 .

    这只是理论上的,因为我没有这些处理器进行测试 .

  • 6

    我以前做过这个确切的任务 . 但它主要用于测量功耗和CPU温度 . 以下代码(相当长)在我的Core i7 2600K上实现了接近最佳状态 .

    这里要注意的关键是大量的手动循环展开以及乘法和交错的交错...

    完整的项目可以在我的GitHub上找到:https://github.com/Mysticial/Flops

    警告:

    If you decide to compile and run this, pay attention to your CPU temperatures!!!
    确保你不会影响你的结果!

    此外,对于运行此代码可能造成的任何损害,我不承担任何责任 .

    Notes:

    • 此代码针对x64进行了优化 . x86没有足够的寄存器来编译 .

    • 此代码已经过测试,可以在Visual Studio 2010/2012和GCC上正常运行4.6 .
      ICC 11(英特尔编译器11)令人惊讶地难以编译它 .

    • 这些用于预FMA处理器 . 为了在Intel Haswell和AMD Bulldozer处理器(及更高版本)上实现峰值FLOPS,将需要FMA(融合乘法加法)指令 . 这些超出了此基准的范围 .

    #include <emmintrin.h>
    #include <omp.h>
    #include <iostream>
    using namespace std;
    
    typedef unsigned long long uint64;
    
    double test_dp_mac_SSE(double x,double y,uint64 iterations){
        register __m128d r0,r1,r2,r3,r4,r5,r6,r7,r8,r9,rA,rB,rC,rD,rE,rF;
    
        //  Generate starting data.
        r0 = _mm_set1_pd(x);
        r1 = _mm_set1_pd(y);
    
        r8 = _mm_set1_pd(-0.0);
    
        r2 = _mm_xor_pd(r0,r8);
        r3 = _mm_or_pd(r0,r8);
        r4 = _mm_andnot_pd(r8,r0);
        r5 = _mm_mul_pd(r1,_mm_set1_pd(0.37796447300922722721));
        r6 = _mm_mul_pd(r1,_mm_set1_pd(0.24253562503633297352));
        r7 = _mm_mul_pd(r1,_mm_set1_pd(4.1231056256176605498));
        r8 = _mm_add_pd(r0,_mm_set1_pd(0.37796447300922722721));
        r9 = _mm_add_pd(r1,_mm_set1_pd(0.24253562503633297352));
        rA = _mm_sub_pd(r0,_mm_set1_pd(4.1231056256176605498));
        rB = _mm_sub_pd(r1,_mm_set1_pd(4.1231056256176605498));
    
        rC = _mm_set1_pd(1.4142135623730950488);
        rD = _mm_set1_pd(1.7320508075688772935);
        rE = _mm_set1_pd(0.57735026918962576451);
        rF = _mm_set1_pd(0.70710678118654752440);
    
        uint64 iMASK = 0x800fffffffffffffull;
        __m128d MASK = _mm_set1_pd(*(double*)&iMASK);
        __m128d vONE = _mm_set1_pd(1.0);
    
        uint64 c = 0;
        while (c < iterations){
            size_t i = 0;
            while (i < 1000){
                //  Here's the meat - the part that really matters.
    
                r0 = _mm_mul_pd(r0,rC);
                r1 = _mm_add_pd(r1,rD);
                r2 = _mm_mul_pd(r2,rE);
                r3 = _mm_sub_pd(r3,rF);
                r4 = _mm_mul_pd(r4,rC);
                r5 = _mm_add_pd(r5,rD);
                r6 = _mm_mul_pd(r6,rE);
                r7 = _mm_sub_pd(r7,rF);
                r8 = _mm_mul_pd(r8,rC);
                r9 = _mm_add_pd(r9,rD);
                rA = _mm_mul_pd(rA,rE);
                rB = _mm_sub_pd(rB,rF);
    
                r0 = _mm_add_pd(r0,rF);
                r1 = _mm_mul_pd(r1,rE);
                r2 = _mm_sub_pd(r2,rD);
                r3 = _mm_mul_pd(r3,rC);
                r4 = _mm_add_pd(r4,rF);
                r5 = _mm_mul_pd(r5,rE);
                r6 = _mm_sub_pd(r6,rD);
                r7 = _mm_mul_pd(r7,rC);
                r8 = _mm_add_pd(r8,rF);
                r9 = _mm_mul_pd(r9,rE);
                rA = _mm_sub_pd(rA,rD);
                rB = _mm_mul_pd(rB,rC);
    
                r0 = _mm_mul_pd(r0,rC);
                r1 = _mm_add_pd(r1,rD);
                r2 = _mm_mul_pd(r2,rE);
                r3 = _mm_sub_pd(r3,rF);
                r4 = _mm_mul_pd(r4,rC);
                r5 = _mm_add_pd(r5,rD);
                r6 = _mm_mul_pd(r6,rE);
                r7 = _mm_sub_pd(r7,rF);
                r8 = _mm_mul_pd(r8,rC);
                r9 = _mm_add_pd(r9,rD);
                rA = _mm_mul_pd(rA,rE);
                rB = _mm_sub_pd(rB,rF);
    
                r0 = _mm_add_pd(r0,rF);
                r1 = _mm_mul_pd(r1,rE);
                r2 = _mm_sub_pd(r2,rD);
                r3 = _mm_mul_pd(r3,rC);
                r4 = _mm_add_pd(r4,rF);
                r5 = _mm_mul_pd(r5,rE);
                r6 = _mm_sub_pd(r6,rD);
                r7 = _mm_mul_pd(r7,rC);
                r8 = _mm_add_pd(r8,rF);
                r9 = _mm_mul_pd(r9,rE);
                rA = _mm_sub_pd(rA,rD);
                rB = _mm_mul_pd(rB,rC);
    
                i++;
            }
    
            //  Need to renormalize to prevent denormal/overflow.
            r0 = _mm_and_pd(r0,MASK);
            r1 = _mm_and_pd(r1,MASK);
            r2 = _mm_and_pd(r2,MASK);
            r3 = _mm_and_pd(r3,MASK);
            r4 = _mm_and_pd(r4,MASK);
            r5 = _mm_and_pd(r5,MASK);
            r6 = _mm_and_pd(r6,MASK);
            r7 = _mm_and_pd(r7,MASK);
            r8 = _mm_and_pd(r8,MASK);
            r9 = _mm_and_pd(r9,MASK);
            rA = _mm_and_pd(rA,MASK);
            rB = _mm_and_pd(rB,MASK);
            r0 = _mm_or_pd(r0,vONE);
            r1 = _mm_or_pd(r1,vONE);
            r2 = _mm_or_pd(r2,vONE);
            r3 = _mm_or_pd(r3,vONE);
            r4 = _mm_or_pd(r4,vONE);
            r5 = _mm_or_pd(r5,vONE);
            r6 = _mm_or_pd(r6,vONE);
            r7 = _mm_or_pd(r7,vONE);
            r8 = _mm_or_pd(r8,vONE);
            r9 = _mm_or_pd(r9,vONE);
            rA = _mm_or_pd(rA,vONE);
            rB = _mm_or_pd(rB,vONE);
    
            c++;
        }
    
        r0 = _mm_add_pd(r0,r1);
        r2 = _mm_add_pd(r2,r3);
        r4 = _mm_add_pd(r4,r5);
        r6 = _mm_add_pd(r6,r7);
        r8 = _mm_add_pd(r8,r9);
        rA = _mm_add_pd(rA,rB);
    
        r0 = _mm_add_pd(r0,r2);
        r4 = _mm_add_pd(r4,r6);
        r8 = _mm_add_pd(r8,rA);
    
        r0 = _mm_add_pd(r0,r4);
        r0 = _mm_add_pd(r0,r8);
    
    
        //  Prevent Dead Code Elimination
        double out = 0;
        __m128d temp = r0;
        out += ((double*)&temp)[0];
        out += ((double*)&temp)[1];
    
        return out;
    }
    
    void test_dp_mac_SSE(int tds,uint64 iterations){
    
        double *sum = (double*)malloc(tds * sizeof(double));
        double start = omp_get_wtime();
    
    #pragma omp parallel num_threads(tds)
        {
            double ret = test_dp_mac_SSE(1.1,2.1,iterations);
            sum[omp_get_thread_num()] = ret;
        }
    
        double secs = omp_get_wtime() - start;
        uint64 ops = 48 * 1000 * iterations * tds * 2;
        cout << "Seconds = " << secs << endl;
        cout << "FP Ops  = " << ops << endl;
        cout << "FLOPs   = " << ops / secs << endl;
    
        double out = 0;
        int c = 0;
        while (c < tds){
            out += sum[c++];
        }
    
        cout << "sum = " << out << endl;
        cout << endl;
    
        free(sum);
    }
    
    int main(){
        //  (threads, iterations)
        test_dp_mac_SSE(8,10000000);
    
        system("pause");
    }
    

    Output (1 thread, 10000000 iterations) - Compiled with Visual Studio 2010 SP1 - x64 Release:

    Seconds = 55.5104
    FP Ops  = 960000000000
    FLOPs   = 1.7294e+010
    sum = 2.22652
    

    该机器是Core i7 2600K @ 4.4 GHz . 理论SSE峰值为4个触发器* 4.4 GHz = 17.6 GFlops . 这段代码达到了 17.3 GFlops - 不错 .

    Output (8 threads, 10000000 iterations) - Compiled with Visual Studio 2010 SP1 - x64 Release:

    Seconds = 117.202
    FP Ops  = 7680000000000
    FLOPs   = 6.55279e+010
    sum = 17.8122
    

    理论SSE峰值为4个触发* 4个核心* 4.4 GHz = 70.4 GFlops. 实际为 65.5 GFlops .


    让我们更进一步 . AVX ...

    #include <immintrin.h>
    #include <omp.h>
    #include <iostream>
    using namespace std;
    
    typedef unsigned long long uint64;
    
    double test_dp_mac_AVX(double x,double y,uint64 iterations){
        register __m256d r0,r1,r2,r3,r4,r5,r6,r7,r8,r9,rA,rB,rC,rD,rE,rF;
    
        //  Generate starting data.
        r0 = _mm256_set1_pd(x);
        r1 = _mm256_set1_pd(y);
    
        r8 = _mm256_set1_pd(-0.0);
    
        r2 = _mm256_xor_pd(r0,r8);
        r3 = _mm256_or_pd(r0,r8);
        r4 = _mm256_andnot_pd(r8,r0);
        r5 = _mm256_mul_pd(r1,_mm256_set1_pd(0.37796447300922722721));
        r6 = _mm256_mul_pd(r1,_mm256_set1_pd(0.24253562503633297352));
        r7 = _mm256_mul_pd(r1,_mm256_set1_pd(4.1231056256176605498));
        r8 = _mm256_add_pd(r0,_mm256_set1_pd(0.37796447300922722721));
        r9 = _mm256_add_pd(r1,_mm256_set1_pd(0.24253562503633297352));
        rA = _mm256_sub_pd(r0,_mm256_set1_pd(4.1231056256176605498));
        rB = _mm256_sub_pd(r1,_mm256_set1_pd(4.1231056256176605498));
    
        rC = _mm256_set1_pd(1.4142135623730950488);
        rD = _mm256_set1_pd(1.7320508075688772935);
        rE = _mm256_set1_pd(0.57735026918962576451);
        rF = _mm256_set1_pd(0.70710678118654752440);
    
        uint64 iMASK = 0x800fffffffffffffull;
        __m256d MASK = _mm256_set1_pd(*(double*)&iMASK);
        __m256d vONE = _mm256_set1_pd(1.0);
    
        uint64 c = 0;
        while (c < iterations){
            size_t i = 0;
            while (i < 1000){
                //  Here's the meat - the part that really matters.
    
                r0 = _mm256_mul_pd(r0,rC);
                r1 = _mm256_add_pd(r1,rD);
                r2 = _mm256_mul_pd(r2,rE);
                r3 = _mm256_sub_pd(r3,rF);
                r4 = _mm256_mul_pd(r4,rC);
                r5 = _mm256_add_pd(r5,rD);
                r6 = _mm256_mul_pd(r6,rE);
                r7 = _mm256_sub_pd(r7,rF);
                r8 = _mm256_mul_pd(r8,rC);
                r9 = _mm256_add_pd(r9,rD);
                rA = _mm256_mul_pd(rA,rE);
                rB = _mm256_sub_pd(rB,rF);
    
                r0 = _mm256_add_pd(r0,rF);
                r1 = _mm256_mul_pd(r1,rE);
                r2 = _mm256_sub_pd(r2,rD);
                r3 = _mm256_mul_pd(r3,rC);
                r4 = _mm256_add_pd(r4,rF);
                r5 = _mm256_mul_pd(r5,rE);
                r6 = _mm256_sub_pd(r6,rD);
                r7 = _mm256_mul_pd(r7,rC);
                r8 = _mm256_add_pd(r8,rF);
                r9 = _mm256_mul_pd(r9,rE);
                rA = _mm256_sub_pd(rA,rD);
                rB = _mm256_mul_pd(rB,rC);
    
                r0 = _mm256_mul_pd(r0,rC);
                r1 = _mm256_add_pd(r1,rD);
                r2 = _mm256_mul_pd(r2,rE);
                r3 = _mm256_sub_pd(r3,rF);
                r4 = _mm256_mul_pd(r4,rC);
                r5 = _mm256_add_pd(r5,rD);
                r6 = _mm256_mul_pd(r6,rE);
                r7 = _mm256_sub_pd(r7,rF);
                r8 = _mm256_mul_pd(r8,rC);
                r9 = _mm256_add_pd(r9,rD);
                rA = _mm256_mul_pd(rA,rE);
                rB = _mm256_sub_pd(rB,rF);
    
                r0 = _mm256_add_pd(r0,rF);
                r1 = _mm256_mul_pd(r1,rE);
                r2 = _mm256_sub_pd(r2,rD);
                r3 = _mm256_mul_pd(r3,rC);
                r4 = _mm256_add_pd(r4,rF);
                r5 = _mm256_mul_pd(r5,rE);
                r6 = _mm256_sub_pd(r6,rD);
                r7 = _mm256_mul_pd(r7,rC);
                r8 = _mm256_add_pd(r8,rF);
                r9 = _mm256_mul_pd(r9,rE);
                rA = _mm256_sub_pd(rA,rD);
                rB = _mm256_mul_pd(rB,rC);
    
                i++;
            }
    
            //  Need to renormalize to prevent denormal/overflow.
            r0 = _mm256_and_pd(r0,MASK);
            r1 = _mm256_and_pd(r1,MASK);
            r2 = _mm256_and_pd(r2,MASK);
            r3 = _mm256_and_pd(r3,MASK);
            r4 = _mm256_and_pd(r4,MASK);
            r5 = _mm256_and_pd(r5,MASK);
            r6 = _mm256_and_pd(r6,MASK);
            r7 = _mm256_and_pd(r7,MASK);
            r8 = _mm256_and_pd(r8,MASK);
            r9 = _mm256_and_pd(r9,MASK);
            rA = _mm256_and_pd(rA,MASK);
            rB = _mm256_and_pd(rB,MASK);
            r0 = _mm256_or_pd(r0,vONE);
            r1 = _mm256_or_pd(r1,vONE);
            r2 = _mm256_or_pd(r2,vONE);
            r3 = _mm256_or_pd(r3,vONE);
            r4 = _mm256_or_pd(r4,vONE);
            r5 = _mm256_or_pd(r5,vONE);
            r6 = _mm256_or_pd(r6,vONE);
            r7 = _mm256_or_pd(r7,vONE);
            r8 = _mm256_or_pd(r8,vONE);
            r9 = _mm256_or_pd(r9,vONE);
            rA = _mm256_or_pd(rA,vONE);
            rB = _mm256_or_pd(rB,vONE);
    
            c++;
        }
    
        r0 = _mm256_add_pd(r0,r1);
        r2 = _mm256_add_pd(r2,r3);
        r4 = _mm256_add_pd(r4,r5);
        r6 = _mm256_add_pd(r6,r7);
        r8 = _mm256_add_pd(r8,r9);
        rA = _mm256_add_pd(rA,rB);
    
        r0 = _mm256_add_pd(r0,r2);
        r4 = _mm256_add_pd(r4,r6);
        r8 = _mm256_add_pd(r8,rA);
    
        r0 = _mm256_add_pd(r0,r4);
        r0 = _mm256_add_pd(r0,r8);
    
        //  Prevent Dead Code Elimination
        double out = 0;
        __m256d temp = r0;
        out += ((double*)&temp)[0];
        out += ((double*)&temp)[1];
        out += ((double*)&temp)[2];
        out += ((double*)&temp)[3];
    
        return out;
    }
    
    void test_dp_mac_AVX(int tds,uint64 iterations){
    
        double *sum = (double*)malloc(tds * sizeof(double));
        double start = omp_get_wtime();
    
    #pragma omp parallel num_threads(tds)
        {
            double ret = test_dp_mac_AVX(1.1,2.1,iterations);
            sum[omp_get_thread_num()] = ret;
        }
    
        double secs = omp_get_wtime() - start;
        uint64 ops = 48 * 1000 * iterations * tds * 4;
        cout << "Seconds = " << secs << endl;
        cout << "FP Ops  = " << ops << endl;
        cout << "FLOPs   = " << ops / secs << endl;
    
        double out = 0;
        int c = 0;
        while (c < tds){
            out += sum[c++];
        }
    
        cout << "sum = " << out << endl;
        cout << endl;
    
        free(sum);
    }
    
    int main(){
        //  (threads, iterations)
        test_dp_mac_AVX(8,10000000);
    
        system("pause");
    }
    

    Output (1 thread, 10000000 iterations) - Compiled with Visual Studio 2010 SP1 - x64 Release:

    Seconds = 57.4679
    FP Ops  = 1920000000000
    FLOPs   = 3.34099e+010
    sum = 4.45305
    

    理论AVX峰值为8个触发器* 4.4 GHz = 35.2 GFlops . 实际是 33.4 GFlops .

    Output (8 threads, 10000000 iterations) - Compiled with Visual Studio 2010 SP1 - x64 Release:

    Seconds = 111.119
    FP Ops  = 15360000000000
    FLOPs   = 1.3823e+011
    sum = 35.6244
    

    理论AVX峰值为8个触发* 4个核心* 4.4 GHz = 140.8 GFlops. 实际为 138.2 GFlops .


    Now for some explanations:

    性能关键部分显然是内循环内的48条指令 . 您会注意到它分为4个块,每个块有12个指令 . 这12个指令块中的每一个都完全相互独立 - 平均需要6个周期才能执行 .

    因此,在使用问题之间有12条指令和6个循环 . 乘法的延迟是5个周期,因此它足以避免延迟停顿 .

    需要规范化步骤以防止数据上溢/下溢 . 这是必需的,因为无操作代码将缓慢地增加/减少数据的大小 .

    因此,如果你只使用全零并且摆脱标准化步骤,那么它实际上可能做得比这更好 . 但是,由于我编写了测量功耗和温度的基准测试,因为执行单元可能会对零使用特殊的情况处理,因为它使用更少的功率并产生更少的热量 .


    更多结果:

    • Intel Core i7 920 @ 3.5 GHz

    • Windows 7旗舰版x64

    • Visual Studio 2010 SP1 - x64发行版

    Threads: 1

    Seconds = 72.1116
    FP Ops  = 960000000000
    FLOPs   = 1.33127e+010
    sum = 2.22652
    

    理论SSE峰值:4个触发器* 3.5 GHz = 14.0 GFlops . 实际是 13.3 GFlops .

    Threads: 8

    Seconds = 149.576
    FP Ops  = 7680000000000
    FLOPs   = 5.13452e+010
    sum = 17.8122
    

    理论SSE峰值:4个触发器* 4个核心* 3.5 GHz = 56.0 GFlops . 实际是 51.3 GFlops .

    My processor temps hit 76C on the multi-threaded run! If you runs these, be sure the results aren't affected by CPU throttling.


    • 2 x Intel Xeon X5482 Harpertown @ 3.2 GHz

    • Ubuntu Linux 10 x64

    • GCC 4.5.2 x64 - (-O2 -msse3 -fopenmp)

    Threads: 1

    Seconds = 78.3357
    FP Ops  = 960000000000
    FLOPs   = 1.22549e+10
    sum = 2.22652
    

    理论SSE峰值:4个触发器* 3.2 GHz = 12.8 GFlops . 实际是 12.3 GFlops .

    Threads: 8

    Seconds = 78.4733
    FP Ops  = 7680000000000
    FLOPs   = 9.78676e+10
    sum = 17.8122
    

    理论SSE峰值:4个触发器* 8个核心* 3.2 GHz = 102.4 GFlops . 实际是 97.9 GFlops .

  • 15

    分支绝对可以阻止您维持峰值理论性能 . 如果您手动进行一些循环展开,您是否看到了差异?例如,如果每次循环迭代放置5到10倍的ops:

    for(int i=0; i<loops/5; i++) {
          mul1*=mul; mul2*=mul; mul3*=mul; mul4*=mul; mul5*=mul;
          sum1+=add; sum2+=add; sum3+=add; sum4+=add; sum5+=add;
          mul1*=mul; mul2*=mul; mul3*=mul; mul4*=mul; mul5*=mul;
          sum1+=add; sum2+=add; sum3+=add; sum4+=add; sum5+=add;
          mul1*=mul; mul2*=mul; mul3*=mul; mul4*=mul; mul5*=mul;
          sum1+=add; sum2+=add; sum3+=add; sum4+=add; sum5+=add;
          mul1*=mul; mul2*=mul; mul3*=mul; mul4*=mul; mul5*=mul;
          sum1+=add; sum2+=add; sum3+=add; sum4+=add; sum5+=add;
          mul1*=mul; mul2*=mul; mul3*=mul; mul4*=mul; mul5*=mul;
          sum1+=add; sum2+=add; sum3+=add; sum4+=add; sum5+=add;
       }
    
  • 461

    在2.4GHz Intel Core 2 Duo上使用Intels icc Version 11.1

    Macintosh:~ mackie$ icc -O3 -mssse3 -oaddmul addmul.cc && ./addmul 1000
    addmul:  0.105 s, 9.525 Gflops, res=0.000000
    Macintosh:~ mackie$ icc -v
    Version 11.1
    

    这非常接近理想的9.6 Gflops .

    编辑:

    哎呀,看看汇编代码,似乎icc不仅可以对乘法进行矢量化,而且还可以将循环中的附加物拉出来 . 强制更严格的fp语义,代码不再向量化:

    Macintosh:~ mackie$ icc -O3 -mssse3 -oaddmul addmul.cc -fp-model precise && ./addmul 1000
    addmul:  0.516 s, 1.938 Gflops, res=1.326463
    

    EDIT2:

    按照要求:

    Macintosh:~ mackie$ clang -O3 -mssse3 -oaddmul addmul.cc && ./addmul 1000
    addmul:  0.209 s, 4.786 Gflops, res=1.326463
    Macintosh:~ mackie$ clang -v
    Apple clang version 3.0 (tags/Apple/clang-211.10.1) (based on LLVM 3.0svn)
    Target: x86_64-apple-darwin11.2.0
    Thread model: posix
    

    clang代码的内部循环如下所示:

    .align  4, 0x90
    LBB2_4:                                 ## =>This Inner Loop Header: Depth=1
            addsd   %xmm2, %xmm3
            addsd   %xmm2, %xmm14
            addsd   %xmm2, %xmm5
            addsd   %xmm2, %xmm1
            addsd   %xmm2, %xmm4
            mulsd   %xmm2, %xmm0
            mulsd   %xmm2, %xmm6
            mulsd   %xmm2, %xmm7
            mulsd   %xmm2, %xmm11
            mulsd   %xmm2, %xmm13
            incl    %eax
            cmpl    %r14d, %eax
            jl      LBB2_4
    

    EDIT3:

    最后,有两点建议:首先,如果您喜欢这种类型的基准测试,请考虑使用 rdtsc 指令而不是 gettimeofday(2) . 它更加准确,可以提供周期时间,这通常是您感兴趣的 . 对于gcc和朋友,你可以这样定义:

    #include <stdint.h>
    
    static __inline__ uint64_t rdtsc(void)
    {
            uint64_t rval;
            __asm__ volatile ("rdtsc" : "=A" (rval));
            return rval;
    }
    

    其次,您应该多次运行基准程序并仅使用最佳性能 . 在现代操作系统中,许多事情并行发生,cpu可能处于低频省电模式等 . 重复运行程序会给你一个更接近理想情况的结果 .

相关问题