Question

我知道如何使用O（n + m）算法在CPU上交叉两个排序列表，其中n和m是2个列表的长度。但是，是否有一个很好的算法可以在GPU上交叉两个列表来避免写入冲突。我很害怕在交叉时，两个线程可能会尝试写入相同的输出缓冲区，从而导致冲突。我不是在找图书馆。我想知道基本的想法+一些代码，如果可能的话

Answer 1

我知道您可能不希望将代码绑定到库。但是，我认为Thrust有一个算法可以完全按照您的要求运行，前提是您在传统的C阵列中处理列表。

在这里查看thrust::merge http://wiki.thrust.googlecode.com/hg/html/group__merging.html

- 编辑 -

在你的问题上多考虑一下，直接在GPU上交叉两个列表似乎很难用CUDA编写。

以下代码段是另一种解决方案（我以前的解决方案和@jmsu的合并）。这是一个如何交叉以递减顺序存储的两个整数列表的示例。列表存储在设备存储器中，但计算不能在内核中执行。因此，如果可能的话，你需要在内核调用之间使用它：

#include <thrust/set_operations.h>
#include <ostream>

int main() {

    int A_host[] = {11, 9, 5, 3};
    int B_host[] = {14, 12, 10, 5, 1};
    int sizeA = 4;
    int sizeB = 5;
    int sizeC = (sizeA < sizeB) ? sizeA : sizeB;

    int C_host[sizeC];

    int* A_device;
    int* B_device;
    int* C_device;

    cudaMalloc( (void**) &A_device, sizeof(int) * sizeA);
    cudaMalloc( (void**) &B_device, sizeof(int) * sizeB);
    cudaMalloc( (void**) &C_device, sizeof(int) * sizeC);

    cudaMemcpy( A_device, A_host, sizeof(int) * sizeA, cudaMemcpyHostToDevice);
    cudaMemcpy( B_device, B_host, sizeof(int) * sizeB, cudaMemcpyHostToDevice);
    cudaMemset( C_device, 0, sizeof(int) * sizeC);

    // add an alias to thrust::device_ptr<int> to be more readable
    typedef thrust::device_ptr<int> ptrI;

    thrust::set_intersection(ptrI(A_device), ptrI(A_device + sizeA), ptrI(B_device), ptrI(B_device + sizeB), ptrI(C_device), thrust::greater<int>() );
    cudaMemcpy( C_host, C_device, sizeof(int) * sizeC, cudaMemcpyDeviceToHost);


    std::copy(C_host, C_host + sizeC, std::ostream_iterator<int> (std::cout, " ") );
}

Answer 2

做这样的事情怎么样：如果数组B中存在值A [i]，则将其存储在C [i]中，否则C [i]：= DUMMY。

然后执行并行数组压缩？有一些工具可以做到这一点 - 例如检查here - 一个库和一篇描述所用算法的论文。

Answer 3

遵循Thrust的想法，因为@jHackTheRipper说你不需要整个Thrust，你只能使用你需要的东西。

要使用Thrust执行此操作，您可以使用thrust :: set_intersection

http://wiki.thrust.googlecode.com/hg/html/group_set_operations.html#ga17277fec1491c8a916c9908a5ae40807

Thrust文档中的示例代码：

#include <thrust/set_operations.h>
...
int A1[6] = {1, 3, 5, 7, 9, 11};
int A2[7] = {1, 1, 2, 3, 5,  8, 13};

int result[7];

int *result_end = thrust::set_intersection(A1, A1 + 6, A2, A2 + 7, result);
// result is now {1, 3, 5}

要在GPU上执行它，您可以将阵列复制到设备内存并使用thrust :: device_ptr或更好的方法是使用thrust :: device_vector。这些与STL载体兼容。

thrust::host_vector<int> h_list1;
thrust::host_vector<int> h_list2;
// insert code to populate the lists...

thrust::device_vector<int> d_list1 = h_list1; // copy list1 from host to device
thrust::device_vector<int> d_list2 = h_list2; // copy list2 from host to device

thrust::device_vector<int> d_result;

thrust::set_intersection(d_list1.begin(), d_list1.end(), d_list2.begin(), d_list2.end(), d_result.begin());

thrust::host_vector<int> h_result = d_result; // copy result from device to host

我没有检查过代码但它应该是接近这个的。推力网站有很好的例子可以帮助你入门。

在gpu上排序列表交集

3 个答案: