Question

下面的代码是C-Python扩展。此代码获取连续原始字节的输入缓冲区（对于我的应用程序，为原始数据块的“块”，其中1个块= 128字节），然后将这些字节处理为2个字节的“样本”，从而得出结果进入项目。返回的结构只是处理为python整数的缓冲区。

这是2个主要功能：

unpack_block（项目，items_offset，缓冲区，buffer_offset，samples_per_block，sample_bits）；

然后循环遍历项中的每个样本，然后将每个样本转换为Python Int。

PyList_SET_ITEM（结果，索引，PyInt_FromLong（items [index]））;

    unsigned int num_blocks_per_thread, num_samples_per_thread, num_bytes_per_thread;
    unsigned int thread_id, p;
    unsigned int n_threads, start_index_bytes, start_index_blocks, start_index_samples;

    items = malloc(num_samples*sizeof(unsigned long));
    assert(items);

    #pragma omp parallel\
    default(none)\
    private(num_blocks_per_thread, num_samples_per_thread, num_bytes_per_thread, d, j, thread_id, n_threads, start_index_bytes, start_index_blocks, start_index_samples)\
    shared(samples_per_block, num_blocks, buffer, bytes_per_block, sample_bits, result, num_samples, items)
      {

        n_threads = omp_get_num_threads();
        num_blocks_per_thread = num_blocks/n_threads;
        num_samples_per_thread = num_samples/n_threads; 
        num_bytes_per_thread = num_blocks_per_thread*samples_per_block*2/n_threads;

        thread_id = omp_get_thread_num();
        start_index_bytes = num_bytes_per_thread*thread_id;
        start_index_blocks = num_blocks_per_thread*thread_id;  
        start_index_samples = num_samples_per_thread*thread_id;

        for (d=0; d<num_blocks_per_thread; d++) {
          unpack_block(items, start_index_samples+d*samples_per_block, buffer, start_index_blocks + d*bytes_per_block, samples_per_block, sample_bits);
        }

      }

     result = PyList_New(num_samples);
     assert(result);

     //*THIS WOULD ALSO SEEM RIPE FOR MULTITHREADING*
     for (p=0; p<num_samples; p++) {
        PyList_SET_ITEM(result, p, PyInt_FromLong( items[p] ));
      }

    free(items);
    free(buffer);

  return result;
}

速度简直是残酷的，远低于我对多线程的期望。我可能会遇到一个错误共享问题，即线程写入 items 数组的不同块，即使每个线程仅处理同一数组的互斥块。

对我来说，一个基本问题是：如何正确地对单个数组的每个元素进行多线程处理，然后将每个元素的结果输出到第二个“结果”数组中。我用两个功能执行了两次。

任何想法，解决方案或优化方法都很棒。谢谢！

Answer 1

您已经提到虚假共享。为了避免这种情况，您必须相应地分配内存（使用posix_memalign或另一个对齐的alloc函数），并选择块大小，以使一个块的数据大小恰好是缓存行大小的倍数。

通常，使用$ N $线程测量执行时间并计算加速比。可以和我们分享加速曲线吗？

关于评论“对于多线程来说似乎已经成熟”：通常，期望值过高（为避免失望，这只是警告之言）。考虑您使用的每个线程有多少个线程/元素以及每个线程的工作量（即，每个项目需要多少计算）。也许工作量很小，以至于OpenMP开销占主导地位。另外，每个内存加载操作需要多少条指令？通常，每个内存负载的许多指令是可以并行化的合理候选者。比率低表示程序受内存限制。

说到内存访问，您是在具有不同NUMA域的多插槽系统上吗？如果是，则必须处理亲缘关系问题。

多线程数组处理，然后写入结果数组以进行C-Python扩展

1 个答案: