Question

我正在尝试使用CUB的segmented-reduction和原语，我坚持使用它。

这是我的代码：

@each $name, $value in $map-one and $name2, $value2 in $map-two {
 .#{$name} {
  content: $value;
 }

 .#{$name2} {
  content: $value2;
 }
}

但结果我得到了这个：

int main() {


     const int N = 7;
     const int num_segments  = 3;
     int d_offsets[]= {0,3,3,7};


    int *h_data       = (int *)malloc(N * sizeof(int));
    int *h_result = (int *)malloc(num_segments * sizeof(int));


    for (int i=0; i<N; i++) {
        h_data[i] = 3;

    }


    int *d_data;
    cudaMalloc((int**)&d_data, N * sizeof(int));
    cudaMemcpy(d_data, h_data, N * sizeof(int), cudaMemcpyHostToDevice);


    int           *d_result;
    cudaMalloc((int**)&d_result, num_segments * sizeof(int));

    void            *d_temp_storage = NULL;
    size_t          temp_storage_bytes = 0;


    cudaMalloc((void**)&d_temp_storage, temp_storage_bytes);


    cub::DeviceSegmentedReduce::Sum(d_temp_storage, temp_storage_bytes, d_data, d_result,
        num_segments, d_offsets, d_offsets + 1);


    cudaMemcpy(h_result, d_result, num_segments*sizeof(int), cudaMemcpyDeviceToHost);




    printf("Results:\n");

   for (int i=0; i<num_segments; i++) {
        printf("CUB: %d\n", h_result[i]);

    }


}

我无法弄清楚究竟是什么问题。在实际例子中，我有一个非常大的数组，其段数等于400。我可以优化代码，这样我就不需要为Results: CUB: 0 CUB: 0 CUB: 0声明和分配内存。

Answer 1

您真的没有认真尝试调试代码：

你错过了d_results的内存分配（你已经修好了）
您试图在d_offsets中传递设备内存地址的主机内存地址。当然这会导致CUDA运行时错误 - 但
您没有检查运行时错误。
您只调用了一次CUB函数 - 虽然您必须运行它两次才能实际执行任何操作：一次使用nullptr作为临时空间，获取临时空间大小，然后再次使用实际划痕工作的空间。这是一个烦人的API，但它的工作原理。

当你没有时间自己调试代码时，浪费SO社区的时间是不合适的。

但是，你可以做些什么来避免必须检查错误，至少，这是使用某种类型的库为你做的（例如通过抛出错误）。如果你这样做 - 例如，使用我的CUDA Runtime API wrappers（对不起自我插件），并为你需要的一切正确分配内存，你最终会得到这样的结果：

#include <cub/cub.cuh>
#include <cuda/api_wrappers.h>
#include <vector>
#include <cstdlib>

int main() {

    const int N = 7;
    const int num_segments  = 3;
    auto h_offsets = std::vector<int> {0,3,3,7};

    auto h_data = std::vector<int>(N);
    auto h_results = std::vector<int>(num_segments);

    std::fill(h_data.begin(), h_data.end(), 3);

    auto current_device = cuda::device::current::get();
    auto d_offsets = cuda::memory::device::make_unique<int[]>(
        current_device, h_offsets.size());
    auto d_data = cuda::memory::device::make_unique<int[]>(
        current_device, N);
    cuda::memory::copy(
        d_offsets.get(), &h_offsets[0], h_offsets.size() * sizeof(int));
    cuda::memory::copy(
        d_data.get(),  &h_data[0], h_data.size() * sizeof(int));
    auto d_results = cuda::memory::device::make_unique<int[]>(
        current_device, num_segments);

    auto d_start_offsets = d_offsets.get();
    auto d_end_offsets = d_start_offsets + 1; // aliasing, see CUB documentation

    size_t temp_storage_bytes = 0;

    // This call merely obtains a value for temp_storage_bytes, passed here
    // as a non-const reference; other arguments are unused
    cub::DeviceSegmentedReduce::Sum(
        nullptr, temp_storage_bytes, d_data.get(), d_results.get(),
        num_segments, d_start_offsets, d_end_offsets);

    auto d_temp_storage = cuda::memory::device::make_unique<char[]>(
        current_device, temp_storage_bytes);

    cub::DeviceSegmentedReduce::Sum(
        d_temp_storage.get(), temp_storage_bytes, d_data.get(), 
        d_results.get(), num_segments, d_start_offsets, d_end_offsets);

    cuda::memory::copy(
        &h_results[0], d_results.get(), num_segments * sizeof(int));

    std::cout << "Results:\n";

    for (int i=0; i<num_segments; i++) {
        std::cout << "Segment " << i << " data sums up to " << h_results[i] << "\n";
    }

    return EXIT_SUCCESS;
}

有效：

Results:
Segment 0 data sums up to 9
Segment 1 data sums up to 0
Segment 2 data sums up to 12

其他提示：

始终调查编译器警告。
使用cuda-memcheck以避免内存泄漏/在错误的设备/主机端初始化等。
如果您直接使用CUDA Runtime API，则必须 check every call for errors。

CUB细分减少不产生结果

1 个答案: