私有Malloc的OpenMP Seg故障

时间:2011-12-06 01:58:25

标签: segmentation-fault malloc openmp

当我尝试在openmp for循环中移动malloc调用时,我遇到了一个有趣的seg错误问题。每个线程必须计算它自己的距离向量副本,以便正确计算分类,因此向量必须是私有的......但是当我尝试用超过1个线程调用它时,它会出现故障。如果将p_distances向量声明为共享,则不会发生这种情况,尽管这会导致距离计算不准确,因为线程会相互覆盖。我在这里违反了一些非常明显的规则......而且,我知道我的代码中还有其他不良编码做法;我总是乐于接受关于风格的建议,但请帮助我专注于实际导致问题的原因。

int *labels_train;
float *data_train;
int *labels_test;
float *data_test;
float *s_distances;
int *s_results, *p_results;
int i, j, k, h;
int N, D, K, M, thread_count;

void sort(float *_distances, int *_labels_train,  int _N);

void computeParallelKNN()
{
// this is the target loop for multi-point parallelization
// seg fault here whenever p_distances malloc is moved inside parallel for loop and declared private
#pragma omp parallel for num_threads(thread_count) private(h, j, i)
for (i = 0; i < M; i++)
{
    float *p_distances = (float*)malloc(N * sizeof(float));
    k = 0;

    // This is the target loop for single point parallelization
    // No dependencies on outer loop (each thread can calculate distance for current point with some
    // different training point)
    for (h = 0; h < N*D; h+=D)
    {
        float dTmp = 0;
        // Reduction operation..no dependencies here either (I don't think?)
        // dTmp is critical variable for parallel operations
        for (j = 0; j < D; j++)
        {
            dTmp += pow(data_test[i*D+j] - data_train[h+j],2);
        }
        p_distances[k] = (float)sqrt((double)dTmp);
        k++;
    }

    // Make a copy of labels (since sort will invalidate original data/labels correlation)
    int *temp_labels;
    temp_labels = (int*)malloc(N * sizeof(int));
    for (h = 0; h < N; h++)
        temp_labels[h] = labels_train[h];

    // Sort distances/labels_train vector
    sort(p_distances, temp_labels, N);

    // Calculate/print KNN classification
    int neg = 0;
    int pos = 0;
    for (h = 0; h < K; h++)
    {
        if(temp_labels[h] == -1) neg++;
        else pos++;
    }
    if (pos > neg) p_results[i] = 1;
    else p_results[i] = -1;

    free(p_distances);
      }
}

// Selection sort algorithm modified to sort labels according to distance data
void sort(float *_distances, int *_labels_train,  int _N) 
{
  int k;
  for (k = 1; k < _N; ++k) 
  {
    float dist_key = _distances[k];
    int label_key = _labels_train[k];
    int i = k - 1;
    while ((i >= 0) && (dist_key < _distances[i])) 
    {
        _distances[i + 1] = _distances[i];
        _labels_train[i + 1] = _labels_train[i];
         --i;
    }
    _distances[i + 1] = dist_key;
   _labels_train[i + 1] = label_key;
}

}

我可以发布完整的代码,但这肯定是发生故障的区域。提前谢谢,希望这只是我犯的一个愚蠢的错误。

1 个答案:

答案 0 :(得分:2)

首先,所有线程共享k;没有声明可以通知它周围的关键部分,或者它应该以原子方式完成。

以更干净的方式重写代码并尽可能避免使用全局变量 - 您可以在刚刚输入新范围时定义变量。

例如,

int i;
void foo() {
#pragma parallel private(i)
{
  // ...
}

与:

相同
void foo() {
#pragma parallel
{
  int i;
  // ...
}