只是玩弄openmp。看看这段代码片段:
#pragma omp parallel
{
for( i =0;i<n;i++)
{
doing something
}
}
和
for( i =0;i<n;i++)
{
#pragma omp parallel
{
doing something
}
}
为什么第一个比第二个慢得多(大约5倍)?从理论上我认为第一个必须更快,因为并行区域只创建一次而不是像第二次那样n次? 有人可以向我解释一下吗?
我想要并行化的代码具有以下结构:
for(i=0;i<n;i++) //wont be parallelizable
{
for(j=i+1;j<n;j++) //will be parallelized
{
doing sth.
}
for(j=i+1;j<n;j++) //will be parallelized
for(k = i+1;k<n;k++)
{
doing sth.
}
}
我做了一个简单的程序来测量时间并重现我的结果。
#include <stdio.h>
#include <omp.h>
void test( int n)
{
int i ;
double t_a = 0.0, t_b = 0.0 ;
t_a = omp_get_wtime() ;
#pragma omp parallel
{
for(i=0;i<n;i++)
{
}
}
t_b = omp_get_wtime() ;
for(i=0;i<n;i++)
{
#pragma omp parallel
{
}
}
printf( "directive outside for-loop: %lf\n", 1000*(omp_get_wtime()-t_a)) ;
printf( "directive inside for-loop: %lf \n", 1000*(omp_get_wtime()-t_b)) ;
}
int main(void)
{
int i, n ;
double t_1 = 0.0, t_2 = 0.0 ;
printf( "n: " ) ;
scanf( "%d", &n ) ;
t_1 = omp_get_wtime() ;
#pragma omp parallel
{
for(i=0;i<n;i++)
{
}
}
t_2 = omp_get_wtime() ;
for(i=0;i<n;i++)
{
#pragma omp parallel
{
}
}
printf( "directive outside for-loop: %lf\n", 1000*(omp_get_wtime()-t_1)) ;
printf( "directive inside for-loop: %lf \n", 1000*(omp_get_wtime()-t_2)) ;
test(n) ;
return 0 ;
}
如果我用不同的n开始,我总会得到不同的结果。
n: 30000
directive outside for-loop: 0.881884
directive inside for-loop: 0.073054
directive outside for-loop: 0.049098
directive inside for-loop: 0.011663
n: 30000
directive outside for-loop: 0.402774
directive inside for-loop: 0.071588
directive outside for-loop: 0.049168
directive inside for-loop: 0.012013
n: 30000
directive outside for-loop: 2.198740
directive inside for-loop: 0.065301
directive outside for-loop: 0.047911
directive inside for-loop: 0.012152
n: 1000
directive outside for-loop: 0.355841
directive inside for-loop: 0.079480
directive outside for-loop: 0.013549
directive inside for-loop: 0.012362
n: 10000
directive outside for-loop: 0.926234
directive inside for-loop: 0.071098
directive outside for-loop: 0.023536
directive inside for-loop: 0.012222
n: 10000
directive outside for-loop: 0.354025
directive inside for-loop: 0.073542
directive outside for-loop: 0.023607
directive inside for-loop: 0.012292
你怎么能解释我这种差异?!
您的版本的结果:
Input n: 1000
[2] directive outside for-loop: 0.331396
[2] directive inside for-loop: 0.002864
[2] directive outside for-loop: 0.011663
[2] directive inside for-loop: 0.001188
[1] directive outside for-loop: 0.021092
[1] directive inside for-loop: 0.001327
[1] directive outside for-loop: 0.005238
[1] directive inside for-loop: 0.001048
[0] directive outside for-loop: 0.020812
[0] directive inside for-loop: 0.001188
[0] directive outside for-loop: 0.005029
[0] directive inside for-loop: 0.001257
答案 0 :(得分:4)
因为并行区域只创建一次而不是第二次创建n次?
有点儿。施工
#pragma omp parallel
{
}
还意味着将工作项分配给'{'上的线程并将线程返回到'}'上的线程池中。它有很多线程到线程的通信。此外,默认情况下,等待线程将通过操作系统进入休眠状态,并且唤醒线程需要一些时间。
关于您的中间示例:您可以尝试使用...
限制外部for
的并行性
#pragma omp parallel private(i,k)
{
for(i=0;i<n;i++) //w'ont be parallelized
{
#pragma omp for
for(j=i+1;j<n,j++) //will be parallelized
{
doing sth.
}
#pragma omp for
for(j=i+1;j<n;j++) //will be parallelized
for(k = i+1;k<n;k++)
{
doing sth.
}
// Is there really nothing? - if no - use:
// won't be parallelized
#pragma omp single
{ //seq part of outer loop
printf("Progress... %i\n", i); fflush(stdout);
}
// here is the point. Every thread did parallel run of outer loop, but...
#pramga omp barrier
// all loop iterations are syncronized:
// thr0 thr1 thr2
// i 0 0 0
// ---- barrier ----
// i 1 1 1
// ---- barrier ----
// i 2 2 2
// and so on
}
}
通常,将并行性置于for
嵌套的最高(上)可能for
比将其置于内环上要好。如果需要顺序执行某些代码,请对此代码使用高级编译指示(如omp barrier
,omp master
或omp single
)或omp_locks。任何这种方式都会比多次启动omp parallel
答案 1 :(得分:2)
您的完整测试非常错误。你确实计算了代码部分和第二部分的时间;不是第一节的时间。另外,printf的第二行确实测量了第一次printf的时间。
首次运行非常慢,因为这里有一个线程启动时间,内存初始化和缓存效果。此外,omp的启发式可以在几个平行区域之后自动调整
我的测试版本:
$ cat test.c
#include <stdio.h>
#include <omp.h>
void test( int n, int j)
{
int i ;
double t_a = 0.0, t_b = 0.0, t_c = 0.0 ;
t_a = omp_get_wtime() ;
#pragma omp parallel
{
for(i=0;i<n;i++) { }
}
t_b = omp_get_wtime() ;
for(i=0;i<n;i++) {
#pragma omp parallel
{ }
}
t_c = omp_get_wtime() ;
printf( "[%i] directive outside for-loop: %lf\n", j, 1000*(t_b-t_a)) ;
printf( "[%i] directive inside for-loop: %lf \n", j, 1000*(t_c-t_b)) ;
}
int main(void)
{
int i, n, j=3 ;
double t_1 = 0.0, t_2 = 0.0, t_3 = 0.0;
printf( "Input n: " ) ;
scanf( "%d", &n ) ;
while( j --> 0 ) {
t_1 = omp_get_wtime();
#pragma omp parallel
{
for(i=0;i<n;i++) { }
}
t_2 = omp_get_wtime();
for(i=0;i<n;i++) {
#pragma omp parallel
{ }
}
t_3 = omp_get_wtime();
printf( "[%i] directive outside for-loop: %lf\n", j, 1000*(t_2-t_1)) ;
printf( "[%i] directive inside for-loop: %lf \n", j, 1000*(t_3-t_2)) ;
test(n,j) ;
}
return 0 ;
}
我为程序内部的每个n做了3次运行。
结果:
$ ./test
Input n: 1000
[2] directive outside for-loop: 5.044824
[2] directive inside for-loop: 48.605116
[2] directive outside for-loop: 0.115031
[2] directive inside for-loop: 1.469195
[1] directive outside for-loop: 0.082415
[1] directive inside for-loop: 1.455855
[1] directive outside for-loop: 0.081297
[1] directive inside for-loop: 1.462352
[0] directive outside for-loop: 0.080528
[0] directive inside for-loop: 1.455786
[0] directive outside for-loop: 0.080807
[0] directive inside for-loop: 1.467101
只有第一轮test()
受到影响。 test
和main()
的所有下一个结果都相同。
更好,更稳定的结果来自此类运行(我使用gcc-4.6.1和静态构建)
$ OMP_WAIT_POLICY=active GOMP_CPU_AFFINITY=0-15 OMP_NUM_THREADS=2 ./test
Input n: 5000
[2] directive outside for-loop: 0.079412
[2] directive inside for-loop: 4.266087
[2] directive outside for-loop: 0.031708
[2] directive inside for-loop: 4.319727
[1] directive outside for-loop: 0.047563
[1] directive inside for-loop: 4.290812
[1] directive outside for-loop: 0.033733
[1] directive inside for-loop: 4.324406
[0] directive outside for-loop: 0.047004
[0] directive inside for-loop: 4.273143
[0] directive outside for-loop: 0.092331
[0] directive inside for-loop: 4.279219
我确实将两个omp性能环境变量和有限的线程数设置为2。
另外。你“并行”循环是错误的。 (我在我的^^^变体中重现了这个错误)i变量在这里共享:
#pragma omp parallel
{
for(i=0;i<n;i++) { }
}
你应该把它作为
#pragma omp parallel
{
for(int local_i=0;local_i<n;local_i++) { }
}
UPDATE7 您的结果是n = 1000:
[2] directive inside for-loop: 0.001188
[1] directive outside for-loop: 0.021092
[1] directive inside for-loop: 0.001327
[1] directive outside for-loop: 0.005238
[1] directive inside for-loop: 0.001048
[0] directive outside for-loop: 0.020812
[0] directive inside for-loop: 0.001188
[0] directive outside for-loop: 0.005029
[0] directive inside for-loop: 0.001257
代码的0.001或0.02输出是......秒乘以1000,所以它是毫秒(ms)。它是......大约1微秒或20微秒。某些系统时钟(user time
实用程序的system time
或time
输出字段)的粒度为1毫秒,3毫秒或10毫秒。 1微秒是2000-3000 CPU滴答(对于2-3GHz CPU)。因此,如果没有特殊设置,您无法测量如此短的时间间隔。你应该:
rdtsc
asm指令)rdtsc
指令(或其他禁用重新排序的指令),在cpuid
之前和之后禁用无序CPU上的指令重新排序(只有当前一代的原子不是OOO cpu) scanf
,通过argv[1]
传递n)UPDATE8:统计我的意思是:取几个值,7个或更多。丢弃第一个值(如果测量的数值很大,则丢弃第一个值)。排序他们。丢弃...最大和最小结果的10-20%。计算平均值。字面上
double results[100], sum=0.0, mean = 0.0;
int count = 0;
// sort results[5]..results[100] here
for(it=20; it< 85; it ++) {
count++; sum+= results[it];
}
mean = sum/count;