Question

所以我一直致力于在全局内存中创建哈希表的程序。该代码在GTS250上完全正常运行（尽管速度较慢），GTS250是Compute 1.1器件。但是，在Compute 2.0设备（C2050或C2070）上，哈希表损坏（数据不正确，指针有时是错误的）。

基本上，当只使用一个块（两个设备）时，代码工作正常。但是，当使用2个或更多块时，它仅适用于GTS250，而不适用于任何费米设备。

据我所知，两个平台之间的warp调度和内存架构是不同的，我在开发代码时考虑到了这一点。根据我的理解，使用__theadfence()应该确保任何全局写入都被提交并且对其他块可见，但是，从损坏的哈希表中看来它们不是。

我也在NVIDIA CUDA开发者论坛上发布了这个问题，可以找到它here。

以下相关代码：

__device__ void lock(int *mutex) {
    while(atomicCAS(mutex, 0, 1) != 0);
}

__device__ void unlock(int *mutex) {
    atomicExch(mutex, 0);
}

__device__ void add_to_global_hash_table(unsigned int key, unsigned int count, unsigned int sum, unsigned int sumSquared, Table table, int *globalHashLocks, int *globalFreeLock, int *globalFirstFree)
{
    // Find entry if it exists
    unsigned int hashValue = hash(key, table.count);

    lock(&globalHashLocks[hashValue]);

    int bucketHead = table.entries[hashValue];
    int currentLocation = bucketHead;

    bool found = false;
    Entry currentEntry;

    while (currentLocation != -1 && !found) {
        currentEntry = table.pool[currentLocation];
        if (currentEntry.data.x == key) {
            found = true;
        } else {
            currentLocation = currentEntry.next;
        }
    }

    if (currentLocation == -1) {
        // If entry does not exist, create entry
        lock(globalFreeLock);
        int newLocation = (*globalFirstFree)++;
        __threadfence();
        unlock(globalFreeLock);

        Entry newEntry;
        newEntry.data.x = key;
        newEntry.data.y = count;
        newEntry.data.z = sum;
        newEntry.data.w = sumSquared;
        newEntry.next = bucketHead;

        // Add entry to table
        table.pool[newLocation] = newEntry;
        table.entries[hashValue] = newLocation;
    } else {
        currentEntry.data.y += count;
        currentEntry.data.z += sum;
        currentEntry.data.w += sumSquared;
        table.pool[currentLocation] = currentEntry;
    }

    __threadfence();
    unlock(&globalHashLocks[hashValue]);
}

Answer 1

正如LSChien中post所指出的，问题在于L1缓存一致性。虽然使用__threadfence()将保证共享和全局内存写入对其他线程可见，因为它不是原子的，thread x中的block 1可能会在thread y之前达到缓存的内存值block 0已执行threadfence指令。相反，LSChien在他的帖子中建议使用atomicCAS()来强制线程从全局内存而不是缓存值中读取。正确的方法是将内存声明为volatile，要求对该内存的每次写入都立即对网格中的所有其他线程可见。

Answer 2

__ threadfence保证在返回之前，对当前块中的其他线程可以看到对全局内存的写入。这与“全局内存写入操作完成”不一样！想想每个多核上的缓存。

CUDA阻止GTS 250和Fermi器件之间的同步差异

2 个答案: