我在OpenMP中使用for循环嵌套制作了这个并行矩阵乘法程序。当我运行程序时,随机(大部分)显示答案,并对所得矩阵进行不同的指示。以下是代码片段:
#pragma omp parallel for
for(i=0;i<N;i++){
#pragma omp parallel for
for(j=0;j<N;j++){
C[i][j]=0;
#pragma omp parallel for
for(m=0;m<N;m++){
C[i][j]=A[i][m]*B[m][j]+C[i][j];
}
printf("C:i=%d j=%d %f \n",i,j,C[i][j]);
}
}
答案 0 :(得分:1)
这些是所谓“竞争条件”的症状,正如评论者已经说过的那样。
OpenMP使用的线程是相互独立的,但是矩阵乘法的各个循环的结果不是,所以一个线程可能与另一个线程处于不同的位置,如果你依赖于它,突然你会遇到麻烦结果的顺序。
您只能并行化最外面的循环:
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
int main(int argc, char **argv)
{
int n;
double **A, **B, **C, **D, t;
int i, j, k;
struct timeval start, stop;
if (argc != 2) {
fprintf(stderr, "Usage: %s a positive integer >= 2 and < 1 mio\n", argv[0]);
exit(EXIT_FAILURE);
}
n = atoi(argv[1]);
if (n <= 2 || n >= 1000000) {
fprintf(stderr, "Usage: %s a positive integer >= 2 and < 1 mio\n", argv[0]);
exit(EXIT_FAILURE);
}
// make it repeatable
srand(0xdeadbeef);
// allocate memory for and initialize A
A = malloc(sizeof(*A) * n);
for (i = 0; i < n; i++) {
A[i] = malloc(sizeof(**A) * n);
for (j = 0; j < n; j++) {
A[i][j] = (double) ((rand() % 100) / 99.);
}
}
// do the same for B
B = malloc(sizeof(*B) * n);
for (i = 0; i < n; i++) {
B[i] = malloc(sizeof(**B) * n);
for (j = 0; j < n; j++) {
B[i][j] = (double) ((rand() % 100) / 99.);
}
}
// and C but initialize with zero
C = malloc(sizeof(*C) * n);
for (i = 0; i < n; i++) {
C[i] = malloc(sizeof(**C) * n);
for (j = 0; j < n; j++) {
C[i][j] = 0.0;
}
}
// ditto with D
D = malloc(sizeof(*D) * n);
for (i = 0; i < n; i++) {
D[i] = malloc(sizeof(**D) * n);
for (j = 0; j < n; j++) {
D[i][j] = 0.0;
}
}
// some coarse timing
gettimeofday(&start, NULL);
// naive matrix multiplication
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
for (k = 0; k < n; k++) {
C[i][j] = C[i][j] + A[i][k] * B[k][j];
}
}
}
gettimeofday(&stop, NULL);
t = ((stop.tv_sec - start.tv_sec) * 1000000u +
stop.tv_usec - start.tv_usec) / 1.e6;
printf("Timing for naive run = %.10g\n", t);
gettimeofday(&start, NULL);
#pragma omp parallel shared(A, B, C) private(i, j, k)
#pragma omp for
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
for (k = 0; k < n; k++) {
D[i][j] = D[i][j] + A[i][k] * B[k][j];
}
}
}
gettimeofday(&stop, NULL);
t = ((stop.tv_sec - start.tv_sec) * 1000000u +
stop.tv_usec - start.tv_usec) / 1.e6;
printf("Timing for parallel run = %.10g\n", t);
// check result
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
if (D[i][j] != C[i][j]) {
printf("Cell %d,%d differs with delta(D_ij-C_ij) = %.20g\n", i, j,
D[i][j] - C[i][j]);
}
}
}
// clean up
for (i = 0; i < n; i++) {
free(A[i]);
free(B[i]);
free(C[i]);
free(D[i]);
}
free(A);
free(B);
free(C);
free(D);
puts("All ok? Bye");
exit(EXIT_SUCCESS);
}
(n>2000
可能需要一些耐心来获得结果)
但这并不完全正确。您可以(但不应该)尝试使用类似
的内容来获取最内层的循环sum = 0.0;
#pragma omp parallel for reduction(+:sum)
for (k = 0; k < n; k++) {
sum += A[i][k] * B[k][j];
}
D[i][j] = sum;
似乎速度不快,小n
的速度更慢。
使用原始代码和n = 2500
(仅一次运行):
Timing for naive run = 124.466307
Timing for parallel run = 44.154538
与减少相同:
Timing for naive run = 119.586365
Timing for parallel run = 43.288371
使用较小的n = 500
Timing for naive run = 0.444061
Timing for parallel run = 0.150842
在这个尺寸减少时已经慢了:
Timing for naive run = 0.447894
Timing for parallel run = 0.245481
非常大n
可能会赢,但我缺乏必要的耐心。
然而,最后一个n = 4000
(仅限OpenMP部分):
正常:
Timing for parallel run = 174.647404
减少:
Timing for parallel run = 179.062463
这种差异仍然完全在误差栏内。
将大矩阵乘以(大约n>100
)的更好方法是Schönhage-Straßen算法。
哦:我只是为了方便而使用方形矩阵,因为它们必须是那种形式!但是如果你有长度比大的矩形矩阵,你可能会尝试改变循环的运行方式;第一列或第一行可以在这里产生显着的差异。