我有一个使用OpenMP显示的C ++代码。
#include <vector>
#include <fstream>
using namespace std;
const int rowSize = 10000;
const int N = 200;
const int nrow = 100; // variable
int main() {
// initialise the results vector and a shared vector
vector<int> results((long)nrow*rowSize);
vector<int> sharedArr(rowSize);
for (int i = 0; i < rowSize; ++i) sharedArr[i] = (rowSize-1-i);
ifstream f("some_40GB_file", ios::binary);
#pragma omp parallel for
for (int i = 0; i < nrow; ++i) {
// load a row from the file
long filerow[rowSize];
#pragma omp critical
{
f.seekg((long)i*rowSize*sizeof(long));
f.read((char*)filerow, rowSize*sizeof(long));
}
// do some computations and store it to the results vector
for (int j = 0; j < rowSize; ++j) {
long res = filerow[sharedArr[j]];
for (int k = 0; k < N; ++k) res += (j+k) * k*k;
results[(long)i*rowSize+j] = res;
}
}
f.close();
}
似乎代码无法与nrow
的数量很好地扩展,尤其是当它nrow = 100000
或更高时。以下是一些计时测试结果:
serial:
nrow: 100, 300, 1000, 3000, 10000, 30000, 100000, 200000, 300000
time: 0.196, 0.601, 1.872, 5.611, 19.796, 57.474, 192.787,392.584,580.323
parallel (8 cores):
nrow: 100, 300, 1000, 3000, 10000, 30000, 100000, 200000, 300000
time: 0.047, 0.134, 0.396, 1.164, 3.875, 11.574, 55.219, 108.113,224.322
speed up:
nrow: 100, 300, 1000, 3000, 10000, 30000, 100000, 200000, 300000
result: 4.17 4.48 4.73 4.82 5.11 4.97 3.49, 3.63 2.59
运行时可以很好地适应串行程序。但是,并行程序不会高于100000
。
为什么会这样?我怀疑当索引超过整数限制(即2^31
)时,与I / O访问有关,但我不确定。
似乎最高速度达到了5.11,而有8个可用核心。有没有办法让并行代码更有效?