i have following code uses omp parallelize monte carlo method. question why serial version of code (monte_carlo_serial) run lot faster parallel version (monte_carlo_parallel). running code on machine 32 cores , following result printed console:
-bash-4.1$ gcc -fopenmp hello.c ;
-bash-4.1$ ./a.out
pi (serial): 3.140856
time taken 0 seconds 50 milliseconds
pi (parallel): 3.132103
time taken 127 seconds 990 milliseconds
#include <stdio.h> #include <stdlib.h> #include <math.h> #include <omp.h> #include <time.h> int niter = 1000000; //number of iterations per loop int monte_carlo_parallel() { double x,y; //x,y value random coordinate int i; //loop counter int count=0; //count holds number of how many coordinates double z; //used check if x^2+y^2<=1 double pi; //holds approx value of pi int numthreads = 32; #pragma omp parallel firstprivate(x, y, z, i) reduction(+:count) num_threads(numthreads) { srand48((int)time(null) ^ omp_get_thread_num()); //give random() seed value (i=0; i<niter; ++i) //main loop { x = (double)drand48(); //gets random x coordinate y = (double)drand48(); //gets random y coordinate z = ((x*x)+(y*y)); //checks see if number inside unit circle if (z<=1) { ++count; //if is, consider valid random point } } } pi = ((double)count/(double)(niter*numthreads))*4.0; printf("pi (parallel): %f\n", pi); return 0; } int monte_carlo_serial(){ double x,y; //x,y value random coordinate int i; //loop counter int count=0; //count holds number of how many coordinates double z; //used check if x^2+y^2<=1 double pi; //holds approx value of pi srand48((int)time(null) ^ omp_get_thread_num()); //give random() seed value (i=0; i<niter; ++i) //main loop { x = (double)drand48(); //gets random x coordinate y = (double)drand48(); //gets random y coordinate z = ((x*x)+(y*y)); //checks see if number inside unit circle if (z<=1) { ++count; //if is, consider valid random point } } pi = ((double)count/(double)(niter))*4.0; printf("pi (serial): %f\n", pi); return 0; } void main(){ clock_t start = clock(), diff; monte_carlo_serial(); diff = clock() - start; int msec = diff * 1000 / clocks_per_sec; printf("time taken %d seconds %d milliseconds \n", msec/1000, msec%1000); start = clock(), diff; monte_carlo_parallel(); diff = clock() - start; msec = diff * 1000 / clocks_per_sec; printf("time taken %d seconds %d milliseconds \n", msec/1000, msec%1000); }
the variable
count
is shared across of spawned threads. each of them has lock count increment it. in addition if threads running on separate cpu's (and there's no possible win if they're not) have cost of sending value of count 1 core , again.
this textbook example of false sharing. accessing count in serial version in register , cost 1 cycle increment. in parallel version not in cache, have tell other cores invalidate cache line, fetch (l3 going take 66 cycles @ best) increment it, , store back. every time count migrates 1 cpu core have minimum ~125 cycle cost lot worse 1. threads never able run in parallel because depend on count.
try modify code each thread has own count, sum values of count threads @ end , /might/ see speedup.
Comments
Post a Comment