我正在为n-ary搜索编写代码,即将搜索空间分成n部分。 将并行代码与没有OpenMP指令的代码(即串行执行)进行比较时,我发现并行代码比串行代码慢很多倍。在多次执行这两个程序之后,我看到并行代码的速度很快,但并非每次都有。这可能是由于缓存层次结构。我正在使用4GB RAM的四核处理器上运行程序。
根据对No speedup with OpenMP的回答,内存绑定性能和负载均衡不适用于小问题,例如数组SIZE 100
。我没有使用任何同步。我也尝试将数组大小增加到10000000,但并行代码的输出并不总是更快。很多时候,串行代码胜过并行代码。
根据http://pages.tacc.utexas.edu/~eijkhout/pcse/html/omp-loop.html工作共享结构末尾的隐式障碍可以使用nowait
子句取消。我尝试添加nowait
子句,我也尝试了调度(动态)和调度(自动)引用https://software.intel.com/en-us/articles/openmp-loop-scheduling,但仍然存在同样的问题。
代码:
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#define SIZE 100
#define NUM_THREADS 4
int* a;
int num;
void nary(int num)
{
int found = 0, low = 0, high = SIZE, step;
int i = 0;
while(!found && low <= high)
{
step = (high-low)/NUM_THREADS;
printf("Low :- %d\tHigh :- %d\tStep :- %d\n", low,high,step);
printf("\n");
#pragma omp parallel for num_threads(NUM_THREADS) shared(low,high,step)
for (i = 0; i < NUM_THREADS; ++i)
{
printf("First element :- %d by thread :- %d\n", a[low+step*i],omp_get_thread_num());
if (a[low+step*i] == num)
{
found = 1;
}
}
printf("\n");
/* First block */
if (a[low+step] > num)
{
high = low + step - 1;
printf("First \nLow :- %d \nHigh :- %d\n\n",low,high);
}
/* Last block */
else if (a[low+step*(NUM_THREADS-1)] < num)
{
low = low + step * (NUM_THREADS-1) + 1;
printf("Last\nLow :- %d \nHigh :- %d\n\n",low,high);
}
/* Middle blocks */
else{
#pragma omp parallel for num_threads(NUM_THREADS) schedule(static) shared(low,high,step)
for (i = 1; i < (NUM_THREADS-1); ++i)
{
if (a[low+step*i] < num && a[low+step*(i+1)] > num)
{
low = low + step*i + 1;
high = low + step*(i+1) - 1;
}
}
printf("middle\nLow :- %d \nHigh :- %d\n\n",low,high);
}
}
if (found == 1)
{
printf("Element found\n");
}
else
{
printf("Element Not found\n");
}
}
int main()
{
int i = 0;
int startTime = omp_get_wtime();
/* Dynamically allocate memory using malloc() */
a = (int*)malloc(sizeof(int) * SIZE);
#pragma omp parallel for schedule(static)
for (i = 0; i < SIZE; ++i)
{
a[i] = i;
}
printf("Enter the element to be searched :- \n");
scanf("%d", &num);
nary(num);
printf("\nExecution time :- %f\n", omp_get_wtime()-startTime);
return 0;
}
并行执行输出:
Enter the element to be searched :-
20
Low :- 0 High :- 100 Step :- 25
First element :- 0 by thread :- 0
First element :- 50 by thread :- 2
First element :- 25 by thread :- 1
First element :- 75 by thread :- 3
First
Low :- 0
High :- 24
Low :- 0 High :- 24 Step :- 6
First element :- 6 by thread :- 1
First element :- 18 by thread :- 3
First element :- 0 by thread :- 0
First element :- 12 by thread :- 2
Last
Low :- 19
High :- 24
Low :- 19 High :- 24 Step :- 1
First element :- 20 by thread :- 1
First element :- 21 by thread :- 2
First element :- 19 by thread :- 0
First element :- 22 by thread :- 3
middle
Low :- 19
High :- 24
Element found
Execution time :- 26.824379
串行执行输出:
Enter the element to be searched :-
20
Low :- 0 High :- 100 Step :- 25
First element :- 0 by thread :- 0
First element :- 25 by thread :- 0
First element :- 50 by thread :- 0
First element :- 75 by thread :- 0
First
Low :- 0
High :- 24
Low :- 0 High :- 24 Step :- 6
First element :- 0 by thread :- 0
First element :- 6 by thread :- 0
First element :- 12 by thread :- 0
First element :- 18 by thread :- 0
Last
Low :- 19
High :- 24
Low :- 19 High :- 24 Step :- 1
First element :- 19 by thread :- 0
First element :- 20 by thread :- 0
First element :- 21 by thread :- 0
First element :- 22 by thread :- 0
middle
Low :- 19
High :- 24
Element found
Execution time :- 4.349347
这背后的原因是什么?这是因为代码中有很多条件语句,条件块中有for循环吗?
答案 0 :(得分:3)
您的方法中存在许多小问题。
首先,二进制搜索速度非常快。在最坏的情况下,它只需要log 2 (n)次迭代。即使只有一万亿个要素进行搜索,这只有40次迭代!每次迭代都非常简单,基本上只需要一次内存访问。因此,对于大型数据集,我们在谈论最坏情况下的几微秒搜索时间。当然,这不会用printf
来污染这些东西。
另一方面,根据some answers,产生一个线程大约需要10微秒。因此,即使是完美的扩展实现,基于并行化单个搜索,也没有任何实际性能提升的可能性。
查看特定代码,每次迭代创建两个并行区域。与并行区域和omp for
工作共享构造(根据实现和操作系统可能会有很大差异)相比,每个线程只需要很少的工作量。
我发现arity和NUM_THREADS
的混合有问题。您的更新步骤包含两个串行执行,剩余的NUM_THREADS-2
间隔由NUM_THREADS
个线程检查...因此对于NUM_THREADS=4
,即使完美并行执行,您也只是减少了4个间隔检查到3个间隔检查,更新步骤加速1.3倍。
此外,您的代码包含严重的竞争条件:在第二个并行循环中修改low
是一个非常糟糕的主意,因为其他线程正在根据low
同时检查其间隔。
如果您希望切实提高在已排序的连续数据中搜索的效果,请查看these slides。如果您想使用OpenMP /线程加速应用程序,您可能应该在更粗糙的级别上进行此操作。