我必须遍历std::map
并且每次迭代中必须完成的工作具有以下属性:
看起来像动态调度的完美场景,不是吗?
然而,就使用OpenMP进行循环并行化而言,非随机访问迭代器(例如std::map
)是令人不安的。对我来说,这个特定代码的性能将是至关重要的,因此在寻找最有效的解决方案时,我创建了以下基准:
#include <omp.h>
#include <iostream>
#include <map>
#include <vector>
#define COUNT 0x00006FFF
#define UNUSED(variable) (void)(variable)
using std::map;
using std::vector;
void test1(map<int, vector<int> >& m) {
double time = omp_get_wtime();
map<int, vector<int> >::iterator iterator = m.begin();
#pragma omp parallel
#pragma omp for schedule(dynamic, 1) nowait
for (size_t i = 0; i < m.size(); ++i) {
vector<int>* v;
#pragma omp critical
v = &iterator->second;
for (size_t j = 0; j < v->size(); ++j) {
(*v)[j] = j;
}
#pragma omp critical
iterator++;
}
printf("Test #1: %f s\n", (omp_get_wtime() - time));
}
void test2(map<int, vector<int> >& m) {
double time = omp_get_wtime();
#pragma omp parallel
{
for (map<int, vector<int> >::iterator i = m.begin(); i != m.end(); ++i) {
#pragma omp single nowait
{
vector<int>& v = i->second;
for (size_t j = 0; j < v.size(); ++j) {
v[j] = j;
}
}
}
}
printf("Test #2: %f s\n", (omp_get_wtime() - time));
}
void test3(map<int, vector<int> >& m) {
double time = omp_get_wtime();
#pragma omp parallel
{
int thread_count = omp_get_num_threads();
int thread_num = omp_get_thread_num();
size_t chunk_size = m.size() / thread_count;
map<int, vector<int> >::iterator begin = m.begin();
std::advance(begin, thread_num * chunk_size);
map<int, vector<int> >::iterator end = begin;
if (thread_num == thread_count - 1)
end = m.end();
else
std::advance(end, chunk_size);
for (map<int, vector<int> >::iterator i = begin; i != end; ++i) {
vector<int>& v = i->second;
for (size_t j = 0; j < v.size(); ++j) {
v[j] = j;
}
}
}
printf("Test #3: %f s\n", (omp_get_wtime() - time));
}
int main(int argc, char** argv) {
UNUSED(argc);
UNUSED(argv);
map<int, vector<int> > m;
for (int i = 0; i < COUNT; ++i) {
m[i] = vector<int>(i);
}
test1(m);
test2(m);
test3(m);
}
我可以提出3种可能的变体来模仿我的任务。代码非常简单并且说明一切,请看一下。我已多次运行测试,这是我的结果:
Test #1: 0.169000 s
Test #2: 0.203000 s
Test #3: 0.194000 s
Test #1: 0.167000 s
Test #2: 0.203000 s
Test #3: 0.191000 s
Test #1: 0.182000 s
Test #2: 0.202000 s
Test #3: 0.197000 s
Test #1: 0.167000 s
Test #2: 0.187000 s
Test #3: 0.211000 s
Test #1: 0.168000 s
Test #2: 0.195000 s
Test #3: 0.192000 s
Test #1: 0.166000 s
Test #2: 0.197000 s
Test #3: 0.199000 s
Test #1: 0.184000 s
Test #2: 0.198000 s
Test #3: 0.199000 s
Test #1: 0.167000 s
Test #2: 0.202000 s
Test #3: 0.207000 s
我发布这个问题是因为我发现这些结果特别且绝对出乎意料:
问题是:
答案 0 :(得分:2)
- 你对这里的并行化有更好的了解吗?
醇>
您可以尝试模仿OpenMP循环的schedule(static,1)
,即不是处理大量连续迭代,而是使用thread_count
步长处理迭代。这是代码:
void test4(map<int, vector<int> >& m) {
double time = omp_get_wtime();
#pragma omp parallel
{
int thread_count = omp_get_num_threads();
int thread_num = omp_get_thread_num();
size_t map_size = m.size();
map<int, vector<int> >::iterator it = m.begin();
std::advance(it, thread_num);
for (int i = thread_num; i < map_size; i+=thread_count) {
vector<int>& v = it->second;
for (size_t j = 0; j < v.size(); ++j) {
v[j] = j;
}
if( i+thread_count < map_size ) std::advance(it, thread_count);
}
}
printf("Test #4: %f s\n", (omp_get_wtime() - time));
}
如果工作量在迭代空间中增加或减少,则 schedule(static,1)
提供比schedule(static)
更好的负载平衡。这是您的测试工作负载的情况。如果每次迭代的工作量是随机的,那么这两种策略应该平均给出相同的平衡。
另一种变体是在原子计数器的帮助下模仿schedule(dynamic)
。代码:
void test5(map<int, vector<int> >& m) {
double time = omp_get_wtime();
int count = 0;
#pragma omp parallel shared(count)
{
int i;
int i_old = 0;
size_t map_size = m.size();
map<int, vector<int> >::iterator it = m.begin();
#pragma omp atomic capture
i = count++;
while (i < map_size) {
std::advance(it, i-i_old);
vector<int>& v = it->second;
for (size_t j = 0; j < v.size(); ++j) {
v[j] = j;
}
i_old = i;
#pragma omp atomic capture
i = count++;
}
}
printf("Test #5: %f s\n", (omp_get_wtime() - time));
}
在循环中,线程决定它应该在地图上推进其局部迭代器的程度。线程首先以原子方式递增计数器并获取其先前的值,从而获得迭代索引,然后通过新索引与前一个索引之间的差异推进迭代器。循环重复,直到计数器增加到地图大小以上。