我在一些错误的缓存方面遇到了问题,与无与伦比的版本相比,使用以下代码时我只能获得较小的加速。
matrix1和matrix2是具有(row,col,val)格式的结构的稀疏矩阵。
void pMultiply(struct SparseRow *matrix1, struct SparseRow *matrix2, int m1Rows, int m2Rows, struct SparseRow **result) {
*result = malloc(1 * sizeof(struct SparseRow));
int resultNonZeroEntries = 0;
#pragma omp parallel for atomic
for(int i = 0; i < m1Rows; i++)
{
int curM1Row = matrix1[i].row;
int curM1Col = matrix1[i].col;
float curM1Value = matrix1[i].val;
for(int j = 0; j < m2Rows; j++)
{
int curM2Row = matrix2[j].row;
int curM2Col = matrix2[j].col;
float curM2Value = matrix2[j].val;
if(curM1Col == curM2Row)
{
*result = realloc(*result,
(sizeof(struct SparseRow)*(resultNonZeroEntries+1)));
(*result)[resultNonZeroEntries].row = curM1Row;
(*result)[resultNonZeroEntries].col = curM2Col;
(*result)[resultNonZeroEntries].val += curM1Value*curM2Value;
resultNonZeroEntries++;
break;
}
}
}
答案 0 :(得分:0)
那里有几个问题:
#pragma omp atomic
子句应放在需要保护以免出现竞争状况的行的前面。在每个步骤重新分配内存可能会导致性能下降。如果无法将内存重新分配到位,而需要将其复制到其他位置,则速度会很慢。由于修改了指针result
的值,它也是错误的根源。重新分配发生时,其他线程将继续运行,并且可能尝试访问“旧”地址处的内存,或者几个线程可能尝试同时重新分配results
。将整个realloc +加法部分放置在关键部分会更安全,但是除了测试行/列索引的相等性之外,它将本质上对函数进行序列化,但这会花费大量开销。线程应在本地缓冲区上使用,然后在以后合并它们的结果。重新分配应该由足够大的块完成。
// Make sure this will compile even without openmp + include memcpy
#include <string.h>
#ifdef _OPENMP
#define thisThread omp_thread_num()
#define nThreads omp_num_threads()
#else
#define thisThread 0
#define nThreads 1
#endif
// shared variables
int totalNonZero,*copyIndex,*threadNonZero;
#pragma omp parallel
{
// each thread now initialize a local buffer and local variables
int localNonZero = 0;
int allocatedSize = 1024;
SparseRow *localResult = malloc(allocatedSize * sizeof(*SparseRow));
// one thread initialize an array
#pragma omp single
{
threadNonZero=malloc(nThreads*sizeof(int));copyIndex=malloc((nThreads+1)*sizeof(int));
}
#pragma omp for
for (int i = 0; i < m1Rows; i++){
/*
* do the same as your initial code but:
* realloc an extra 1024 lines each time localNonZeros exceeds allocatedSize
* fill the local buffer and increment the localNonZeros counter
* this is safe, no need to use critical / atomic clauses
*/
}
copyIndex[thisThread]=localNonZero; //put number of non zero into a shared variable
#pragma omp barrier
// Wrap_up : check how many non zero values for each thread, allocate the output and check where each thread will copy its local buffer
#pragma omp single
{
copyIndex[0]=0;
for (int i=0; i<nThreads; ii++)
copyIndex[i+1]=localNonZero[i]+copyIndex[i];
result=malloc(copyIndex[nThreads+1]*sizeof(*SparseRow));
}
// Copy the results from local to global result memcpy(&result[copyIndex[thisThread]],localResult,localNonZero*sizeof(*SparseRow);
// Free memory
free(localResult);
#pragma omp single
{
free(copyIndex);
free(localNonZero);
}
} // end parallel
请注意,该算法将生成重复项,例如如果第一个矩阵在位置(1,10)和(1,20)包含值,第二个矩阵(10,5)和(20,5)包含值,则结果中将有两行(1,5)。在某个时候,将需要合并重复行的压缩函数。