应用错误收集

CUDA代替syncthreads而不是threadfence（）的区别

时间：2011-03-09 04:49:42

标签： cuda

我从NVIDIA手册中复制了以下代码，例如：__threadfence()。为什么他们有在下面的代码中使用了__threadfence()。我认为使用__syncthreads()代替 __threadfence()会给你相同的结果。

有人可以解释__syncthreads()和__threadfence()来电之间的区别吗？

__device__ unsigned int count = 0;
__shared__ bool isLastBlockDone;

__global__ void sum(const float* array, unsigned int N,float* result)
{
    // Each block sums a subset of the input array
    float partialSum = calculatePartialSum(array, N);

    if (threadIdx.x == 0) {
        // Thread 0 of each block stores the partial sum
        // to global memory
        result[blockIdx.x] = partialSum;

        // Thread 0 makes sure its result is visible to
        // all other threads
        __threadfence();

        // Thread 0 of each block signals that it is done
        unsigned int value = atomicInc(&count, gridDim.x);

        // Thread 0 of each block determines if its block is
        // the last block to be done
        isLastBlockDone = (value == (gridDim.x - 1));
    }

    // Synchronize to make sure that each thread reads
    // the correct value of isLastBlockDone
    __syncthreads();

    if (isLastBlockDone) 
    {
        // The last block sums the partial sums
        // stored in result[0 .. gridDim.x-1]
        float totalSum = calculateTotalSum(result);

        if (threadIdx.x == 0)
        {
            // Thread 0 of last block stores total sum
            // to global memory and resets count so that
            // next kernel call works properly
            result[0] = totalSum;
            count = 0;
        }
    }
}

1 个答案:

答案 0 :(得分：15)

就共享内存而言，__syncthreads()比__threadfence()更强大。关于全球记忆 - 这是两件不同的事情。

__threadfence_block()停止当前线程，直到对共享内存的所有写入对同一块中的其他线程可见。它通过在寄存器中缓存共享内存写入来防止编译器进行优化。它不同步线程，并不是所有线程都必须实际到达此指令。
__threadfence()停止当前线程，直到对共享和全局内存的所有写入对所有其他线程都可见。
__syncthreads()（例如，没有发散if语句），并确保在之前执行指令之前的代码 ，对于块中的所有线程。

在您的特定情况下，__threadfence()指令用于确保对全局数组result的写入对所有人可见。 __syncthreads()只会同步当前块中的线程，而不会强制执行其他块的全局内存写入。更重要的是，在代码中你在if分支内的那一点上，只有一个线程正在执行该代码;使用__syncthreads()会导致GPU的未定义行为，最有可能导致内核完全失步。

查看CUDA C编程指南中的以下章节：

3.2.2“共享内存” - 矩阵乘法的例子

5.4.3“同步指令”

B.2.5“volatile”

B.5“记忆围栏功能”

相关问题

CUDA __threadfence（）

CUDA代替__syncthreads而不是__threadfence（）的区别

__syncthreads（）死锁

CUDA __syncthreads（）和递归

__syncthreads无法在CUDA中工作

Visual Studio + Nsight：__syncthreads（）未定义

__threadfence（）和L1缓存一致性

__syncthreads似乎不起作用

在CUDA中的syncthreads和__syncthreads同义词

__threadfence意味着__syncthreads的效果？

最新问题

我写了这段代码，但我无法理解我的错误

我无法从一个代码实例的列表中删除 None 值，但我可以在另一个实例中。为什么它适用于一个细分市场而不适用于另一个细分市场？

是否有可能使 loadstring 不可能等于打印？卢阿

java中的random.expovariate()

Appscript 通过会议在 Google 日历中发送电子邮件和创建活动

为什么我的 Onclick 箭头功能在 React 中不起作用？

在此代码中是否有使用“this”的替代方法？

在 SQL Server 和 PostgreSQL 上查询，我如何从第一个表获得第二个表的可视化

每千个数字得到

更新了城市边界 KML 文件的来源？