我需要计算巨大矩阵X
和向量y
的叉积。 X
有n=2500
行和p=1000,000
列。 y
是1000 x 1矢量。由于X
文件非常庞大且无法加载到RAM中,因此我将其分为10个块,每个块包含2500行和chunk_cols = 100,000
列。然后我将数据块逐块读入内存并逐列进行交叉编译。
我还应该提到X
文件被预处理成二进制文件,该文件将列逐列存储到长long向量中。所以在二进制文件中,第一个(双)元素是X[0,0]
,第二个是X[1,0]
,依此类推。
我尝试将crossprod部分与Opemmp
并行化,希望加快计算速度。但事实证明并行化更糟糕,慢了50秒。
4核:
Unit: seconds
expr min lq
res.cpp <- crossprod_cpp(xfname, chunk_cols, y, n, p, useCores = 4) 122.5716 122.5716
mean median uq max neval
122.5716 122.5716 122.5716 122.5716 1
1核心:
Unit: seconds
expr min lq
res.cpp <- crossprod_cpp(xfname, chunk_cols, y, n, p, useCores = 1) 72.56355 72.56355
mean median uq max neval
72.56355 72.56355 72.56355 72.56355 1
以下是我的代码。
// [[Rcpp::export]]
NumericVector crossprod_cpp(SEXP filename, int chunk_cols, const std::vector<double>& y,
int n, int p, int useCores) {
NumericVector result(p);
unsigned long int chunk_size = chunk_cols * sizeof(double) * n;
const char *xfname = CHAR(Rf_asChar(filename));
ifstream xfile(xfname, ios::in|ios::binary);
int i, j;
if (xfile.is_open()) {
streampos size_x;
char *memblock_x;
int chunk_i = 0;
int chunks = 0;
int col_pos = 0;
xfile.seekg (0, ios::end);
size_x = xfile.tellg();
xfile.seekg (0, ios::beg);
chunks = size_x / chunk_size;
double *X;
omp_set_dynamic(0);
omp_set_num_threads(useCores);
memblock_x = (char *) calloc(chunk_size, sizeof(char));
for(chunk_i = 0; chunk_i < chunks; chunk_i++) {
col_pos = chunk_i * chunk_cols; // current column position;
xfile.seekg (chunk_i * chunk_size, ios::beg);
xfile.read (memblock_x, chunk_size);
size_t count = xfile.gcount();
if (!count) {
Rprintf("\n\t Error in reading chunk %d\n", chunk_i);
break;
}
X = (double*) memblock_x;
// loop over loaded columns and do crossprod
#pragma omp parallel for schedule(dynamic)
for (j = 0; j < chunk_cols; j++) {
for (i = 0; i < n; i++) {
result[col_pos + j] += X[j*n+i] * y[i];
}
}
}
free(memblock_x);
xfile.close();
} else {
Rprintf("Open file failed! filename = %s, chunk_size = %lu\n", xfname, chunk_size);
}
return result;
}
我正在使用clang-omp++
编译器 Mac OS X 。
代码的编译信息如下。
/Library/Frameworks/R.framework/Resources/bin/R CMD SHLIB -o 'sourceCpp_3.so' --preclean 'test_XTy.cpp'
clang-omp++ -std=c++11 -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include -I"/Library/Frameworks/R.framework/Versions/3.2/Resources/library/Rcpp/include" -I"/Library/Frameworks/R.framework/Versions/3.2/Resources/library/BH/include" -I"/Library/Frameworks/R.framework/Versions/3.2/Resources/library/bigmemory/include" -I"/Users/yazeng/GitHub" -fopenmp -std=c++11 -fPIC -Wall -mtune=core2 -g -O2 -c test_XTy.cpp -o test_XTy.o
clang-omp++ -std=c++11 -dynamiclib -Wl,-headerpad_max_install_names -undefined dynamic_lookup -single_module -multiply_defined suppress -L/Library/Frameworks/R.framework/Resources/lib -L/usr/local/lib -o sourceCpp_3.so test_XTy.o -fopenmp -lgomp -F/Library/Frameworks/R.framework/.. -framework R -Wl,-framework -Wl,CoreFoundation
平行部分非常简单如下。我真的很感激任何建议。非常感谢你提前!!!
// loop over loaded columns and do crossprod
#pragma omp parallel for schedule(dynamic)
for (j = 0; j < chunk_cols; j++) {
for (i = 0; i < n; i++) {
result[col_pos + j] += X[j*n+i] * y[i];
}
}
修改
我已经找到了并行化的交叉生产部分。然而,事实证明耗时的部分是文件读取(毫不奇怪)。下面是使用1核(系列代码)的块读取和跨产品计算的时间。
Chunk 0, read start: Now time: 2016-04-14 14:55:19.000
Chunk 0, read end: Now time: 2016-04-14 14:55:27.000
Crossprod start: Now time: 2016-04-14 14:55:27.000
Crossprod start: Now time: 2016-04-14 14:55:27.000
Chunk 1, read start: Now time: 2016-04-14 14:55:27.000
Chunk 1, read end: Now time: 2016-04-14 14:55:33.000
Crossprod start: Now time: 2016-04-14 14:55:33.000
Crossprod start: Now time: 2016-04-14 14:55:34.000
Chunk 2, read start: Now time: 2016-04-14 14:55:34.000
Chunk 2, read end: Now time: 2016-04-14 14:55:40.000
Crossprod start: Now time: 2016-04-14 14:55:40.000
Crossprod start: Now time: 2016-04-14 14:55:40.000
Chunk 3, read start: Now time: 2016-04-14 14:55:40.000
Chunk 3, read end: Now time: 2016-04-14 14:55:47.000
Crossprod start: Now time: 2016-04-14 14:55:47.000
Crossprod start: Now time: 2016-04-14 14:55:47.000
Chunk 4, read start: Now time: 2016-04-14 14:55:47.000
Chunk 4, read end: Now time: 2016-04-14 14:55:53.000
Crossprod start: Now time: 2016-04-14 14:55:53.000
Crossprod start: Now time: 2016-04-14 14:55:53.000
Chunk 5, read start: Now time: 2016-04-14 14:55:53.000
Chunk 5, read end: Now time: 2016-04-14 14:56:00.000
Crossprod start: Now time: 2016-04-14 14:56:00.000
Crossprod start: Now time: 2016-04-14 14:56:00.000
Chunk 6, read start: Now time: 2016-04-14 14:56:00.000
Chunk 6, read end: Now time: 2016-04-14 14:56:06.000
Crossprod start: Now time: 2016-04-14 14:56:06.000
Crossprod start: Now time: 2016-04-14 14:56:07.000
Chunk 7, read start: Now time: 2016-04-14 14:56:07.000
Chunk 7, read end: Now time: 2016-04-14 14:56:13.000
Crossprod start: Now time: 2016-04-14 14:56:13.000
Crossprod start: Now time: 2016-04-14 14:56:13.000
Chunk 8, read start: Now time: 2016-04-14 14:56:13.000
Chunk 8, read end: Now time: 2016-04-14 14:56:20.000
Crossprod start: Now time: 2016-04-14 14:56:20.000
Crossprod start: Now time: 2016-04-14 14:56:20.000
Chunk 9, read start: Now time: 2016-04-14 14:56:20.000
Chunk 9, read end: Now time: 2016-04-14 14:56:26.000
Crossprod start: Now time: 2016-04-14 14:56:26.000
Crossprod start: Now time: 2016-04-14 14:56:27.000
> print(bench.cpp)
Unit: seconds
expr min lq
res.cpp1 <- crossprod_cpp(xfname, chunk_cols, y, n, p, useCores = 1) 67.51534 67.51534
mean median uq max neval
67.51534 67.51534 67.51534 67.51534 1
可以看出,每个块读取时间超过6秒(每个块大约为2GB),即使有1个核心,计算也只需要1秒。因此,无论如何并行化crossprod部分都不会加速。
那么我的问题是,如何并行化文件阅读?
我从here和here了解到,文件I / O从the kernel will have to bring the disk file in sequentially anyway开始基本上是顺序的。所以我想知道是否有任何智能的方法来并行化文件读取? MPI会在这种情况下工作吗?或者有什么提示可以加速这个功能吗?
非常感谢!
结束编辑
编辑2:GPU并行计算在这里工作吗?
我是GPU计算的新手,但在这里大声思考。只是想知道精心编程的GPU计算是否会带来任何可能的加速?我的猜测不是因为磁盘读取是瓶颈。但我真的想知道它是否有用。
END EDIT 2