我的代码中有#pragma omp parallel
部分的问题。
我hava程序应该使用多个线程使用quicksort对给定的整数数组进行排序。为此,在每个步骤中为每个线程分配一部分数组,对其进行分区并返回比给定全局数据透镜小的元素数。代码执行没有错误,但是我告诉omp使用的线程越多,执行的速度就越慢。我为执行时添加了日志记录,看起来该程序的一大部分用于OpenMP的开销。开销似乎是一致的,因此速度差异与要排序的数组大小成正比。
以下是并行执行的代码:
void create_count_elems_lower(int lower, int upper, int global_pivot_position, int *block_sizes, int *data,
int *count_elems_lower) {
assert(lower >= 0);
times_function_called++;
int pivot = data[global_pivot_position];
double start = omp_get_wtime();
double wait_start = 0;
double wait_time = 0;
#pragma omp parallel for
for (int i = 0; i < omp_get_max_threads(); ++i) {
double start_thread = omp_get_wtime();
int lower_p = lower;
lower_p += i == 0 ? 0 : block_sizes[i - 1] * i;
count_elems_lower[i] = partition_fixed_pivot(lower_p, lower_p + block_sizes[i], pivot, data) -
lower_p; // - lower_p since it needs to be relative
assert(count_elems_lower[i] >= 0);
double end_thread = omp_get_wtime();
double time_spent = end_thread - start_thread;
time_spent_per_thread_sum += time_spent;
if (max_time_spent_per_thread[i] < time_spent) {
max_time_spent_per_thread[i] = time_spent;
}
if (wait_start == 0) {
wait_start = end_thread;
} else {
double time_waiting = end_thread - wait_start;
if (time_waiting > wait_time) {
wait_time = time_waiting;
}
}
}
double end = omp_get_wtime();
time_spent_in_function += end - start;
time_spent_idling += wait_time;
}
这里是测试功能:
printf("Num threads: %d\n", num_threads);
double start = omp_get_wtime();
test_sort_big();
double end = omp_get_wtime();
printf("total: %f\n", end - start);
printf("times function called: %f\n", times_function_called);
printf("time spent in create_count_elems_lower: %f\n", time_spent_in_function);
printf("time spent per thread approx: %f\n", time_spent_per_thread_sum / num_threads);
printf("time spent idling: %f\n", time_spent_idling);
for (int i = 0; i < num_threads; ++i) {
printf("max time spent by thread %d: %f \t", i, max_time_spent_per_thread[i]);
}
程序编译并链接:
gcc -fopenmp -O3 -c -o tests/tests.o tests/tests.c
gcc -fopenmp -o build_test tests/tests.o array_utils.o datagenerator.o quicksort.o
结果是:
Num threads: 1
Testing sorting of 10000000 Elements
total: 9.204632
times function called: 10000000.000000
time spent in create_count_elems_lower: 5.914602
time spent per thread approx: 1.610363
time spent idling: 0.000000
max time spent by thread 0: 0.041889
Num threads: 4
Testing sorting of 10000000 Elements
total: 16.955334
times function called: 10000000.000000
time spent in create_count_elems_lower: 12.598185
time spent per thread approx: 0.874607
time spent idling: 2.130419
max time spent by thread 0: 0.016055 max time spent by thread 1: 0.013543 max time spent by thread 2: 0.013532 max time spent by thread 3: 0.018599
我使用英特尔®酷睿™i7-2760QM CPU运行Fedora 27 64位@ 2.40GHz×8
修改 事实证明,开销就是问题所在,因为只用一个线程就会多次调用该方法,当只有一个线程可用时,将算法更改为一个简单的排序,可以大大改善运行时。