首页 文章

为什么使用OpenMP会使代码运行速度变慢?

提问于
浏览
0

我有以下代码使用OMP来并行化monte carlo方法 . 我的问题是为什么代码的串行版本(monte_carlo_serial)比并行版本(monte_carlo_parallel)运行得快得多 . 我在具有32个内核的计算机上运行代码,并将以下结果打印到控制台:

-bash-4.1 $ gcc -fopenmp hello.c;
-bash-4.1 $ ./a.out
Pi(Serial):3.140856
花费的时间为0秒50毫秒
Pi(平行):3.132103
耗时127秒990毫秒

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>
#include <time.h>

int niter = 1000000;            //number of iterations per FOR loop

int monte_carlo_parallel() {
  double x,y;                     //x,y value for the random coordinate
  int i;                          //loop counter
  int count=0;                //Count holds all the number of how many good coordinates
  double z;                       //Used to check if x^2+y^2<=1
  double pi;                      //holds approx value of pi
  int numthreads = 32;

#pragma omp parallel firstprivate(x, y, z, i) reduction(+:count) num_threads(numthreads)
  {
    srand48((int)time(NULL) ^ omp_get_thread_num());    //Give random() a seed value
    for (i=0; i<niter; ++i)                 //main loop
      {
        x = (double)drand48();              //gets a random x coordinate
        y = (double)drand48();              //gets a random y coordinate
        z = ((x*x)+(y*y));              //Checks to see if number is inside unit circle
        if (z<=1)
          {
            ++count;                //if it is, consider it a valid random point
          }
      }
  }

  pi = ((double)count/(double)(niter*numthreads))*4.0;
  printf("Pi (Parallel): %f\n", pi);
  return 0;
}

int monte_carlo_serial(){
  double x,y;                     //x,y value for the random coordinate
  int i;                          //loop counter
  int count=0;                //Count holds all the number of how many good coordinates
  double z;                       //Used to check if x^2+y^2<=1
  double pi;                      //holds approx value of pi

  srand48((int)time(NULL) ^ omp_get_thread_num());  //Give random() a seed value

  for (i=0; i<niter; ++i)                   //main loop
    {
      x = (double)drand48();                //gets a random x coordinate
      y = (double)drand48();                //gets a random y coordinate
      z = ((x*x)+(y*y));                //Checks to see if number is inside unit circle
      if (z<=1)
        {
          ++count;              //if it is, consider it a valid random point
        }
    }

  pi = ((double)count/(double)(niter))*4.0;
  printf("Pi (Serial): %f\n", pi);

  return 0;
}


void main(){
  clock_t start = clock(), diff;

  monte_carlo_serial();

  diff = clock() - start;
  int msec = diff * 1000 / CLOCKS_PER_SEC;
  printf("Time taken %d seconds %d milliseconds \n", msec/1000, msec%1000);



  start = clock(), diff;

  monte_carlo_parallel();

  diff = clock() - start;
  msec = diff * 1000 / CLOCKS_PER_SEC;
  printf("Time taken %d seconds %d milliseconds \n", msec/1000, msec%1000);

}

1 回答

  • -3

    变量

    count
    

    在所有衍生线程中共享 . 他们每个人都必须锁定计数才能增加它 . 此外,如果线程在不同的cpu上运行(并且如果它们不存在就没有可能获胜),则需要将count的值从一个核心发送到另一个核心并再次返回 .

    这是虚假共享的教科书示例 . 访问序列版本中的计数它将在寄存器中并且需要花费1个周期来递增 . 在并行版本中,它通常不在缓存中,您必须告诉其他内核使该缓存行无效,获取它(L3最多需要66个周期)递增它,并将其存储回来 . 每当count从一个cpu核心迁移到另一个cpu核心时,你的周期成本最低约为125,这比1要差很多 . 线程永远不能并行运行,因为它们依赖于计数 .

    尝试修改你的代码,以便每个线程都有自己的计数,然后将最后所有线程的所有计数值加起来,你/可能/看到加速 .

相关问题