Question

我的理解（参见例如How can I enforce CUDA global memory coherence without declaring pointer as volatile?，CUDA block synchronization differences between GTS 250 and Fermi devices和this post in the nvidia Developer Zone）__threadfence()保证在线程继续之前全局写入对其他线程可见。但是，在__threadfence()返回之后，另一个线程仍然可以从其L1缓存中读取过时值，即使是。

那是：

线程A将一些数据写入全局内存，然后调用__threadfence()。然后，在某些时间 __threadfence()返回后，并且写入对所有其他线程可见，则要求线程B从该内存位置读取。它发现它有L1中的数据，所以加载它。不幸的是，对于开发人员来说，线程B的L1中的数据是陈旧的（即它是在线程A更新此数据之前）。

首先：这是正确的吗？

假设它是，那么在我看来__threadfence()只有在任何一个某些数据不在L1中时（有些不太可能？）或者例如读取总是绕过L1（例如易失性或原子）。这是对的吗？

我问，因为我有一个相对简单的用例 - 使用原子设置标志和__threadfence()在二叉树上传播数据：第一个到达节点的线程退出，第二个线程将数据写入其中基于它的两个孩子（例如他们的数据最少）。这适用于大多数节点，但通常至少有一个节点失败。声明数据volatile可以得到一致的正确结果，但会导致99％以上从L1中没有获取陈旧值的情况下的性能损失。我想确定这是此算法的唯一解决方案。下面给出一个简化的例子。请注意，节点数组按宽度优先排序，叶子从索引start开始，并且已经填充了数据。

__global__ void propagate_data(volatile Node *nodes, const unsigned int n_nodes, const unsigned int start, unsigned int* flags) { int tid, index, left, right; float data; bool first_arrival; tid = start + threadIdx.x + blockIdx.x*blockDim.x; while (tid < n_nodes) { // We start at a node with a full data section; modify its flag // accordingly. flags[tid] = 2; // Immediately move up the tree. index = nodes[tid].parent; first_arrival = (atomicAdd(&flags[index], 1) == 0); // If we are the second thread to reach this node then process it. while (!first_arrival) { left = nodes[index].left; right = nodes[index].right; // If Node* nodes is not declared volatile, this occasionally // reads a stale value from L1. data = min(nodes[left].data, nodes[right].data); nodes[index].data = data; if (index == 0) { // Root node processed, so all nodes processed. return; } // Ensure above global write is visible to all device threads // before setting flag for the parent. __threadfence(); index = nodes[index].parent; first_arrival = (atomicAdd(&flags[index], 1) == 0); } tid += blockDim.x*gridDim.x; } return; }

Answer 1

首先：这是正确的吗？

是的，__threadfence()将数据推送到L2并输出到全局内存。它对其他 SM中的L1缓存没有影响。

这是对的吗？

是的，如果将__threadfence()和volatile组合用于全局内存访问，则应该确信值最终会对其他线程块可见。但请注意，在CUDA中，线程块之间的同步不是一个定义明确的概念。没有明确的机制可以这样做，并且不保证线程块执行的顺序，所以仅仅因为你的代码在某个__threadfence()项上运行volatile，仍然不能保证什么数据另一个线程块可能会接收。这也取决于执行的顺序。

如果您使用volatile，则应绕过L1（如果启用 - current Kepler devices don't really have L1 enabled for general global access）。如果您不使用volatile，那么当前正在执行__threadfence()操作的SM的L1应该在__threadfence()完成时与L2（和全局）保持一致/一致操作

请注意，L2缓存在整个设备上是统一的，因此始终是“一致的”。对于您的用例，至少从设备代码的角度来看，无论您使用哪种SM，L2和全局内存之间都没有区别。

并且，正如您所指出的，（全局）原子总是在L2 /全局存储器上运行。

__threadfence（）和L1缓存一致性

1 个答案: