Question

我的程序有很多4字节字符串，比如＆＃34; aaaa＆＃34; ＆＃34; BBBB＆＃34; ＆＃34; cccc＆＃34; ...我需要收集通过crc检查的特定字符串。

因为字符串传递crc检查的可能性很小，所以我不想使用非常大的缓冲区来保存所有结果。我更喜欢逐个结合的结果，就像输入一样。例如，如果输入是＆＃34; aaaabbbbcccc＆＃34;和＆＃34; bbbb＆＃34;没有通过crc检查，输出字符串应该是＆＃34; aaaacccc＆＃34;和output_count应为2。

代码如下：

__device__
bool is_crc_correct(char* str, int len) {
    return true; // for simplicity, just return 'true';
}

// arguments:
// input: a sequence of 4-bytes-string, eg: aaaabbbbccccdddd....
__global__
void func(char* input, int* output, int* output_count) {
    unsigned int index = blockDim.x*blockIdx.x + threadIdx.x;

    if(is_crc_correct(input + 4*index)) {
        // copy the string
        memcpy(output + (*output_count)*4,
               input + 4*index,
               4);
        // increase the counter
        (*output_count)++;
    }
}

显然内存副本不是线程安全的，我知道atomicAdd函数可以用于++操作，但是如何使输出和output_count线程都安全？

Answer 1

您正在寻找的是无锁线性分配器。通常的方法是使用一个原子增加的累加器，用于索引到缓冲区。例如，在您的情况下，以下内容应该起作用：

__device__
char* allocate(char* buffer, int* elements) {
    // Here, the size of the allocated segment is always 4.
    // In a more general use case you would atomicAdd the requested size.
    return buffer + atomicInc(elements) * 4;
}

然后可以这样使用：

__global__
void func(char* input, int* output, int* output_count) {
    unsigned int index = blockDim.x*blockIdx.x + threadIdx.x;

    if(is_crc_correct(input + 4*index)) {
        // Reserve the output buffer.
        char* dst = allocate(output, output_count);
        memcpy(dst, input + 4 * index, 4);
    }
}

虽然这是完全线程安全的，但不保证保留输入顺序。例如，＆＃34; ccccaaaa＆＃34;将是一个有效的输出。

正如Drop在他们的评论中所提到的，你要做的是有效的流压缩（并且Thrust已经可能已经提供了你需要的东西）。

我上面发布的代码可以通过首先通过warp 聚合输出字符串而不是直接分配到全局缓冲区来进一步优化。这将减少全球原子争用，并可能带来更好的性能。有关如何执行此操作的说明，我邀请您阅读以下文章：CUDA Pro Tip: Optimized Filtering with Warp-Aggregated Atomics。

Answer 2

我最终可能会建议这样做，但是如何在内核中动态分配内存？请参阅此问题/答案以获取示例：CUDA allocate memory in __device__ function

然后您将共享内存数组传递给每个内核，并且在内核运行之后，数组的每个元素将指向一块动态分配的内存，或者为NULL。所以在你的threadblocks运行之后，你将在一个线程上运行一个最终的清理内核来构建最终的字符串。

在CUDA中同步的多个变量

2 个答案: