我有以下代码使用OMP来并行化monte carlo方法 . 我的问题是为什么代码的串行版本(monte_carlo_serial)比并行版本(monte_carlo_parallel)运行得快得多 . 我在具有32个内核的计算机上运行代码,并将以下结果打印到控制台:
-bash-4.1 $ gcc -fopenmp hello.c;
-bash-4.1 $ ./a.out
Pi(Serial):3.140856
花费的时间为0秒50毫秒
Pi(平行):3.132103
耗时127秒990毫秒
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>
#include <time.h>
int niter = 1000000; //number of iterations per FOR loop
int monte_carlo_parallel() {
double x,y; //x,y value for the random coordinate
int i; //loop counter
int count=0; //Count holds all the number of how many good coordinates
double z; //Used to check if x^2+y^2<=1
double pi; //holds approx value of pi
int numthreads = 32;
#pragma omp parallel firstprivate(x, y, z, i) reduction(+:count) num_threads(numthreads)
{
srand48((int)time(NULL) ^ omp_get_thread_num()); //Give random() a seed value
for (i=0; i<niter; ++i) //main loop
{
x = (double)drand48(); //gets a random x coordinate
y = (double)drand48(); //gets a random y coordinate
z = ((x*x)+(y*y)); //Checks to see if number is inside unit circle
if (z<=1)
{
++count; //if it is, consider it a valid random point
}
}
}
pi = ((double)count/(double)(niter*numthreads))*4.0;
printf("Pi (Parallel): %f\n", pi);
return 0;
}
int monte_carlo_serial(){
double x,y; //x,y value for the random coordinate
int i; //loop counter
int count=0; //Count holds all the number of how many good coordinates
double z; //Used to check if x^2+y^2<=1
double pi; //holds approx value of pi
srand48((int)time(NULL) ^ omp_get_thread_num()); //Give random() a seed value
for (i=0; i<niter; ++i) //main loop
{
x = (double)drand48(); //gets a random x coordinate
y = (double)drand48(); //gets a random y coordinate
z = ((x*x)+(y*y)); //Checks to see if number is inside unit circle
if (z<=1)
{
++count; //if it is, consider it a valid random point
}
}
pi = ((double)count/(double)(niter))*4.0;
printf("Pi (Serial): %f\n", pi);
return 0;
}
void main(){
clock_t start = clock(), diff;
monte_carlo_serial();
diff = clock() - start;
int msec = diff * 1000 / CLOCKS_PER_SEC;
printf("Time taken %d seconds %d milliseconds \n", msec/1000, msec%1000);
start = clock(), diff;
monte_carlo_parallel();
diff = clock() - start;
msec = diff * 1000 / CLOCKS_PER_SEC;
printf("Time taken %d seconds %d milliseconds \n", msec/1000, msec%1000);
}
1 回答
变量
在所有衍生线程中共享 . 他们每个人都必须锁定计数才能增加它 . 此外,如果线程在不同的cpu上运行(并且如果它们不存在就没有可能获胜),则需要将count的值从一个核心发送到另一个核心并再次返回 .
这是虚假共享的教科书示例 . 访问序列版本中的计数它将在寄存器中并且需要花费1个周期来递增 . 在并行版本中,它通常不在缓存中,您必须告诉其他内核使该缓存行无效,获取它(L3最多需要66个周期)递增它,并将其存储回来 . 每当count从一个cpu核心迁移到另一个cpu核心时,你的周期成本最低约为125,这比1要差很多 . 线程永远不能并行运行,因为它们依赖于计数 .
尝试修改你的代码,以便每个线程都有自己的计数,然后将最后所有线程的所有计数值加起来,你/可能/看到加速 .