为什么我的程序在嵌套时会产生随机结果?

时间:2017-01-07 20:48:50

标签: c gcc parallel-processing nested openmp

我在OpenMP中使用for循环嵌套制作了这个并行矩阵乘法程序。当我运行程序时,随机(大部分)显示答案,并对所得矩阵进行不同的指示。以下是代码片段:

#pragma omp parallel for

for(i=0;i<N;i++){
    #pragma omp parallel for
    for(j=0;j<N;j++){
        C[i][j]=0; 
        #pragma omp parallel for
            for(m=0;m<N;m++){
                C[i][j]=A[i][m]*B[m][j]+C[i][j];
            }
        printf("C:i=%d j=%d %f \n",i,j,C[i][j]);
  }
}

1 个答案:

答案 0 :(得分:1)

这些是所谓“竞争条件”的症状,正如评论者已经说过的那样。

OpenMP使用的线程是相互独立的,但是矩阵乘法的各个循环的结果不是,所以一个线程可能与另一个线程处于不同的位置,如果你依赖于它,突然你会遇到麻烦结果的顺序。

您只能并行化最外面的循环:

#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>

int main(int argc, char **argv)
{
  int n;
  double **A, **B, **C, **D, t;
  int i, j, k;
  struct timeval start, stop;

  if (argc != 2) {
    fprintf(stderr, "Usage: %s a positive integer >= 2 and < 1 mio\n", argv[0]);
    exit(EXIT_FAILURE);
  }

  n = atoi(argv[1]);
  if (n <= 2 || n >= 1000000) {
    fprintf(stderr, "Usage: %s a positive integer >= 2 and < 1 mio\n", argv[0]);
    exit(EXIT_FAILURE);
  }
  // make it repeatable
  srand(0xdeadbeef);

  // allocate memory for and initialize A
  A = malloc(sizeof(*A) * n);
  for (i = 0; i < n; i++) {
    A[i] = malloc(sizeof(**A) * n);
    for (j = 0; j < n; j++) {
      A[i][j] = (double) ((rand() % 100) / 99.);
    }
  }
  // do the same for B
  B = malloc(sizeof(*B) * n);
  for (i = 0; i < n; i++) {
    B[i] = malloc(sizeof(**B) * n);
    for (j = 0; j < n; j++) {
      B[i][j] = (double) ((rand() % 100) / 99.);
    }
  }

  // and C but initialize with zero
  C = malloc(sizeof(*C) * n);
  for (i = 0; i < n; i++) {
    C[i] = malloc(sizeof(**C) * n);
    for (j = 0; j < n; j++) {
      C[i][j] = 0.0;
    }
  }

  // ditto with D
  D = malloc(sizeof(*D) * n);
  for (i = 0; i < n; i++) {
    D[i] = malloc(sizeof(**D) * n);
    for (j = 0; j < n; j++) {
      D[i][j] = 0.0;
    }
  }

  // some coarse timing
  gettimeofday(&start, NULL);
  // naive matrix multiplication
  for (i = 0; i < n; i++) {
    for (j = 0; j < n; j++) {
      for (k = 0; k < n; k++) {
        C[i][j] = C[i][j] + A[i][k] * B[k][j];
      }
    }
  }
  gettimeofday(&stop, NULL);
  t = ((stop.tv_sec - start.tv_sec) * 1000000u +
       stop.tv_usec - start.tv_usec) / 1.e6;
  printf("Timing for naive run    = %.10g\n", t);

  gettimeofday(&start, NULL);
#pragma omp parallel shared(A, B, C) private(i, j, k)
#pragma omp for
  for (i = 0; i < n; i++) {
    for (j = 0; j < n; j++) {
      for (k = 0; k < n; k++) {
        D[i][j] = D[i][j] + A[i][k] * B[k][j];
      }
    }
  }
  gettimeofday(&stop, NULL);
  t = ((stop.tv_sec - start.tv_sec) * 1000000u +
       stop.tv_usec - start.tv_usec) / 1.e6;
  printf("Timing for parallel run = %.10g\n", t);

  // check result
  for (i = 0; i < n; i++) {
    for (j = 0; j < n; j++) {
      if (D[i][j] != C[i][j]) {
        printf("Cell %d,%d differs with delta(D_ij-C_ij) = %.20g\n", i, j,
               D[i][j] - C[i][j]);
      }
    }
  }

  // clean up
  for (i = 0; i < n; i++) {
    free(A[i]);
    free(B[i]);
    free(C[i]);
    free(D[i]);
  }
  free(A);
  free(B);
  free(C);
  free(D);

  puts("All ok? Bye");

  exit(EXIT_SUCCESS);
}

n>2000可能需要一些耐心来获得结果)

但这并不完全正确。您可以(但不应该)尝试使用类似

的内容来获取最内层的循环
sum = 0.0;
#pragma omp parallel for reduction(+:sum)
for (k = 0; k < n; k++) {
   sum +=  A[i][k] * B[k][j];
}
D[i][j] = sum;

似乎速度不快,小n的速度更慢。 使用原始代码和n = 2500(仅一次运行):

Timing for naive run    = 124.466307
Timing for parallel run = 44.154538

与减少相同:

Timing for naive run    = 119.586365
Timing for parallel run = 43.288371

使用较小的n = 500

Timing for naive run    = 0.444061
Timing for parallel run = 0.150842

在这个尺寸减少时已经慢了:

Timing for naive run    = 0.447894
Timing for parallel run = 0.245481

非常n可能会赢,但我缺乏必要的耐心。 然而,最后一个n = 4000(仅限OpenMP部分):

正常:

Timing for parallel run = 174.647404

减少:

Timing for parallel run = 179.062463

这种差异仍然完全在误差栏内。

将大矩阵乘以(大约n>100)的更好方法是Schönhage-Straßen算法。

哦:我只是为了方便而使用方形矩阵,因为它们必须是那种形式!但是如果你有长度比大的矩形矩阵,你可能会尝试改变循环的运行方式;第一列或第一行可以在这里产生显着的差异。