如何从同一个文件并行化块读取过程?

时间:2016-04-14 16:50:36

标签: c++ parallel-processing openmp gpu-programming

我需要计算巨大矩阵X和向量y的叉积。 Xn=2500行和p=1000,000列。 y是1000 x 1矢量。由于X文件非常庞大且无法加载到RAM中,因此我将其分为10个块,每个块包含2500行和chunk_cols = 100,000列。然后我将数据块逐块读入内存并逐列进行交叉编译。

我还应该提到X文件被预处理成二进制文件,该文件将列逐列存储到长long向量中。所以在二进制文件中,第一个(双)元素是X[0,0],第二个是X[1,0],依此类推。

我尝试将crossprod部分与Opemmp并行化,希望加快计算速度。但事实证明并行化更糟糕,慢了50秒。

4核:

Unit: seconds
                                                                expr      min       lq
 res.cpp <- crossprod_cpp(xfname, chunk_cols, y, n, p, useCores = 4) 122.5716 122.5716
     mean   median       uq      max neval
 122.5716 122.5716 122.5716 122.5716     1

1核心:

Unit: seconds
                                                                expr      min       lq
 res.cpp <- crossprod_cpp(xfname, chunk_cols, y, n, p, useCores = 1) 72.56355 72.56355
     mean   median       uq      max neval
 72.56355 72.56355 72.56355 72.56355     1

以下是我的代码。

// [[Rcpp::export]]
NumericVector crossprod_cpp(SEXP filename, int chunk_cols, const std::vector<double>& y, 
                            int n, int p, int useCores) {
  NumericVector result(p);
  unsigned long int chunk_size = chunk_cols * sizeof(double) * n;
  const char *xfname = CHAR(Rf_asChar(filename));
  ifstream xfile(xfname, ios::in|ios::binary);

  int i, j;

  if (xfile.is_open()) {
    streampos size_x;
    char *memblock_x;
    int chunk_i = 0;
    int chunks = 0;
    int col_pos = 0;

    xfile.seekg (0, ios::end);
    size_x = xfile.tellg();
    xfile.seekg (0, ios::beg);
    chunks = size_x / chunk_size;
    double *X;

    omp_set_dynamic(0);
    omp_set_num_threads(useCores);

    memblock_x = (char *) calloc(chunk_size, sizeof(char));
    for(chunk_i = 0; chunk_i < chunks; chunk_i++) {
      col_pos = chunk_i * chunk_cols; // current column position;
      xfile.seekg (chunk_i * chunk_size, ios::beg);
      xfile.read (memblock_x, chunk_size);

      size_t count = xfile.gcount();
      if (!count) {
        Rprintf("\n\t Error in reading chunk %d\n", chunk_i);
        break;
      } 
      X = (double*) memblock_x;

      // loop over loaded columns and do crossprod
      #pragma omp parallel for schedule(dynamic) 
      for (j = 0; j < chunk_cols; j++) {
        for (i = 0; i < n; i++) {
          result[col_pos + j] += X[j*n+i] * y[i];
        }
      }
    }
    free(memblock_x);
    xfile.close();

  } else {
    Rprintf("Open file failed! filename = %s, chunk_size = %lu\n", xfname, chunk_size);
  }

  return result;
}

我正在使用clang-omp++编译器 Mac OS X

代码的编译信息如下。

/Library/Frameworks/R.framework/Resources/bin/R CMD SHLIB -o 'sourceCpp_3.so' --preclean  'test_XTy.cpp'  
clang-omp++ -std=c++11 -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG  -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include  -I"/Library/Frameworks/R.framework/Versions/3.2/Resources/library/Rcpp/include" -I"/Library/Frameworks/R.framework/Versions/3.2/Resources/library/BH/include" -I"/Library/Frameworks/R.framework/Versions/3.2/Resources/library/bigmemory/include" -I"/Users/yazeng/GitHub"   -fopenmp -std=c++11 -fPIC  -Wall -mtune=core2 -g -O2 -c test_XTy.cpp -o test_XTy.o
clang-omp++ -std=c++11 -dynamiclib -Wl,-headerpad_max_install_names -undefined dynamic_lookup -single_module -multiply_defined suppress -L/Library/Frameworks/R.framework/Resources/lib -L/usr/local/lib -o sourceCpp_3.so test_XTy.o -fopenmp -lgomp -F/Library/Frameworks/R.framework/.. -framework R -Wl,-framework -Wl,CoreFoundation

平行部分非常简单如下。我真的很感激任何建议。非常感谢你提前!!!

// loop over loaded columns and do crossprod
#pragma omp parallel for schedule(dynamic) 
for (j = 0; j < chunk_cols; j++) {
   for (i = 0; i < n; i++) {
      result[col_pos + j] += X[j*n+i] * y[i];
   }
}

修改

我已经找到了并行化的交叉生产部分。然而,事实证明耗时的部分是文件读取(毫不奇怪)。下面是使用1核(系列代码)的块读取和跨产品计算的时间。

Chunk 0, read start: Now time: 2016-04-14 14:55:19.000
Chunk 0, read end: Now time: 2016-04-14 14:55:27.000
Crossprod start: Now time: 2016-04-14 14:55:27.000
Crossprod start: Now time: 2016-04-14 14:55:27.000
Chunk 1, read start: Now time: 2016-04-14 14:55:27.000
Chunk 1, read end: Now time: 2016-04-14 14:55:33.000
Crossprod start: Now time: 2016-04-14 14:55:33.000
Crossprod start: Now time: 2016-04-14 14:55:34.000
Chunk 2, read start: Now time: 2016-04-14 14:55:34.000
Chunk 2, read end: Now time: 2016-04-14 14:55:40.000
Crossprod start: Now time: 2016-04-14 14:55:40.000
Crossprod start: Now time: 2016-04-14 14:55:40.000
Chunk 3, read start: Now time: 2016-04-14 14:55:40.000
Chunk 3, read end: Now time: 2016-04-14 14:55:47.000
Crossprod start: Now time: 2016-04-14 14:55:47.000
Crossprod start: Now time: 2016-04-14 14:55:47.000
Chunk 4, read start: Now time: 2016-04-14 14:55:47.000
Chunk 4, read end: Now time: 2016-04-14 14:55:53.000
Crossprod start: Now time: 2016-04-14 14:55:53.000
Crossprod start: Now time: 2016-04-14 14:55:53.000
Chunk 5, read start: Now time: 2016-04-14 14:55:53.000
Chunk 5, read end: Now time: 2016-04-14 14:56:00.000
Crossprod start: Now time: 2016-04-14 14:56:00.000
Crossprod start: Now time: 2016-04-14 14:56:00.000
Chunk 6, read start: Now time: 2016-04-14 14:56:00.000
Chunk 6, read end: Now time: 2016-04-14 14:56:06.000
Crossprod start: Now time: 2016-04-14 14:56:06.000
Crossprod start: Now time: 2016-04-14 14:56:07.000
Chunk 7, read start: Now time: 2016-04-14 14:56:07.000
Chunk 7, read end: Now time: 2016-04-14 14:56:13.000
Crossprod start: Now time: 2016-04-14 14:56:13.000
Crossprod start: Now time: 2016-04-14 14:56:13.000
Chunk 8, read start: Now time: 2016-04-14 14:56:13.000
Chunk 8, read end: Now time: 2016-04-14 14:56:20.000
Crossprod start: Now time: 2016-04-14 14:56:20.000
Crossprod start: Now time: 2016-04-14 14:56:20.000
Chunk 9, read start: Now time: 2016-04-14 14:56:20.000
Chunk 9, read end: Now time: 2016-04-14 14:56:26.000
Crossprod start: Now time: 2016-04-14 14:56:26.000
Crossprod start: Now time: 2016-04-14 14:56:27.000
> print(bench.cpp)
Unit: seconds
                                                                 expr      min       lq
 res.cpp1 <- crossprod_cpp(xfname, chunk_cols, y, n, p, useCores = 1) 67.51534 67.51534
     mean   median       uq      max neval
 67.51534 67.51534 67.51534 67.51534     1

可以看出,每个块读取时间超过6秒(每个块大约为2GB),即使有1个核心,计算也只需要1秒。因此,无论如何并行化crossprod部分都不会加速。

那么我的问题是,如何并行化文件阅读?

我从herehere了解到,文件I / O从the kernel will have to bring the disk file in sequentially anyway开始基本上是顺序的。所以我想知道是否有任何智能的方法来并行化文件读取? MPI会在这种情况下工作吗?或者有什么提示可以加速这个功能吗?

非常感谢!

结束编辑

编辑2:GPU并行计算在这里工作吗?

我是GPU计算的新手,但在这里大声思考。只是想知道精心编程的GPU计算是否会带来任何可能的加速?我的猜测不是因为磁盘读取是瓶颈。但我真的想知道它是否有用。

END EDIT 2

0 个答案:

没有答案