在CUDA中,如何为内核中的所有线程创建一个屏障来等待,直到 CPU 向该屏障发送一个安全/有帮助的信号继续进行?
我想避免启动CUDA内核的开销。要避免两种类型的开销:(1)在X块和Y线程上简单地启动内核的成本,以及(2)重新初始化我的共享内存所花费的时间,这在调用之间将具有相同的内容
我们在CPU工作负载中始终回收/重用线程。 CUDA甚至提供event
个同步原语。也许提供更传统的信号对象可能是最小的硬件成本。
这里有一些代码为我所寻求的概念提供了一个漏洞。读者可能希望搜索QUESTION IS HERE
。在Nsight中构建它需要将设备链接器模式设置为单独编译(至少,我发现它是必要的)。
#include <iostream>
#include <numeric>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <cuda_runtime_api.h>
#include <cuda.h>
static void CheckCudaErrorAux (const char *, unsigned, const char *, cudaError_t);
#define CUDA_CHECK_RETURN(value) CheckCudaErrorAux(__FILE__,__LINE__, #value, value)
const int COUNT_DOWN_ITERATIONS = 1000;
const int KERNEL_MAXIMUM_LOOPS = 5; // IRL, we'd set this large enough to prevent hitting this value, unless the kernel is externally terminated
const int SIGNALS_TO_SEND_COUNT = 3;
const int BLOCK_COUNT = 1;
const int THREADS_PER_BLOCK = 2;
__device__ void count_down(int * shared_location_to_ensure_side_effect) {
int x = *shared_location_to_ensure_side_effect;
for (int i = 0; i < COUNT_DOWN_ITERATIONS; ++i) {
x += i;
}
*shared_location_to_ensure_side_effect = x;
}
/**
* CUDA kernel waits for events and then counts down upon receiving them.
*/
__global__ void kernel(cudaStream_t stream, cudaEvent_t go_event, cudaEvent_t done_event, int ** cuda_malloc_managed_int_address) {
__shared__ int local_copy_of_cuda_malloc_managed_int_address; // we always start at 0
printf("Block %i, Thread %i: entered kernel\n", blockIdx.x, threadIdx.x);
for (int i = 0; i < KERNEL_MAXIMUM_LOOPS; ++i) {
printf("Block %i, Thread %i: entered loop; waitin 4 go_event\n", blockIdx.x, threadIdx.x);
// QUESTION IS HERE: I want this to block on receiving a signal from the
// CPU, indicating that work is ready to be done
cudaStreamWaitEvent(stream, go_event, cudaEventBlockingSync);
printf("Block %i, Thread %i: in loop; received go_event\n", blockIdx.x, threadIdx.x);
if (i == 0) { // we have received the signal and data is ready to be interpreted
local_copy_of_cuda_malloc_managed_int_address = cuda_malloc_managed_int_address[blockIdx.x][threadIdx.x];
}
count_down(&local_copy_of_cuda_malloc_managed_int_address);
printf("Block %i, Thread %i: finished counting\n", blockIdx.x, threadIdx.x);
cudaEventRecord(done_event, stream);
printf("Block %i, Thread %i: recorded event; may loop back\n", blockIdx.x, threadIdx.x);
}
printf("Block %i, Thread %i: copying result %i back to managed memory\n", blockIdx.x, threadIdx.x, local_copy_of_cuda_malloc_managed_int_address);
cuda_malloc_managed_int_address[blockIdx.x][threadIdx.x] = local_copy_of_cuda_malloc_managed_int_address;
printf("Block %i, Thread %i: exiting kernel\n", blockIdx.x, threadIdx.x);
}
int main(void)
{
int ** data;
cudaMallocManaged(&data, BLOCK_COUNT * sizeof(int *));
for (int b = 0; b < BLOCK_COUNT; ++b)
cudaMallocManaged(&(data[b]), THREADS_PER_BLOCK * sizeof(int));
cudaEvent_t go_event;
cudaEventCreateWithFlags(&go_event, cudaEventBlockingSync);
cudaEvent_t done_event;
cudaEventCreateWithFlags(&done_event, cudaEventBlockingSync);
cudaStream_t stream;
cudaStreamCreate(&stream);
CUDA_CHECK_RETURN(cudaDeviceSynchronize()); // probably unnecessary
printf("CPU: spawning kernel\n");
kernel<<<BLOCK_COUNT, THREADS_PER_BLOCK, sizeof(int), stream>>>(stream, go_event, done_event, data);
for (int i = 0; i < SIGNALS_TO_SEND_COUNT; ++i) {
usleep(4 * 1000 * 1000); // accepts time in microseconds
// Simulate the sending of the "next" piece of work
data[0][0] = i; // unrolled, because it's easier to read
data[0][1] = i + 1; // unrolled, because it's easier to read
printf("CPU: sending go_event\n");
cudaEventRecord(go_event, stream);
cudaStreamWaitEvent(stream, done_event, cudaEventBlockingSync); // doesn't block even though I wish it would
}
CUDA_CHECK_RETURN(cudaDeviceSynchronize());
for (int b = 0; b < BLOCK_COUNT; ++b) {
for (int t = 0; t < THREADS_PER_BLOCK; ++t) {
printf("Result for Block %i and Thread %i: %i\n", b, t, data[b][t]);
}
}
for (int b = 0; b < BLOCK_COUNT; ++b)
cudaFree(data[b]);
cudaFree(data);
cudaEventDestroy(done_event);
cudaEventDestroy(go_event);
cudaStreamDestroy(stream);
printf("CPU: exiting program");
return 0;
}
/**
* Check the return value of the CUDA runtime API call and exit
* the application if the call has failed.
*/
static void CheckCudaErrorAux (const char *file, unsigned line, const char *statement, cudaError_t err)
{
if (err == cudaSuccess)
return;
std::cerr << statement<<" returned " << cudaGetErrorString(err) << "("<<err<< ") at "<<file<<":"<<line << std::endl;
exit (1);
}
这是运行它的输出。请注意,输出是“错误的”,只是因为它们被循环覆盖,其信号应该是GPU线程的阻塞机制。
CPU: spawning kernel
Block 0, Thread 0: entered kernel
Block 0, Thread 1: entered kernel
Block 0, Thread 0: entered loop; waitin 4 go_event
Block 0, Thread 1: entered loop; waitin 4 go_event
Block 0, Thread 0: in loop; received go_event
Block 0, Thread 1: in loop; received go_event
Block 0, Thread 0: finished counting
Block 0, Thread 1: finished counting
Block 0, Thread 0: recorded event; may loop back
Block 0, Thread 1: recorded event; may loop back
Block 0, Thread 0: entered loop; waitin 4 go_event
Block 0, Thread 1: entered loop; waitin 4 go_event
Block 0, Thread 0: in loop; received go_event
Block 0, Thread 1: in loop; received go_event
Block 0, Thread 0: finished counting
Block 0, Thread 1: finished counting
Block 0, Thread 0: recorded event; may loop back
Block 0, Thread 1: recorded event; may loop back
Block 0, Thread 0: entered loop; waitin 4 go_event
Block 0, Thread 1: entered loop; waitin 4 go_event
Block 0, Thread 0: in loop; received go_event
Block 0, Thread 1: in loop; received go_event
Block 0, Thread 0: finished counting
Block 0, Thread 1: finished counting
Block 0, Thread 0: recorded event; may loop back
Block 0, Thread 1: recorded event; may loop back
Block 0, Thread 0: entered loop; waitin 4 go_event
Block 0, Thread 1: entered loop; waitin 4 go_event
Block 0, Thread 0: in loop; received go_event
Block 0, Thread 1: in loop; received go_event
Block 0, Thread 0: finished counting
Block 0, Thread 1: finished counting
Block 0, Thread 0: recorded event; may loop back
Block 0, Thread 1: recorded event; may loop back
Block 0, Thread 0: entered loop; waitin 4 go_event
Block 0, Thread 1: entered loop; waitin 4 go_event
Block 0, Thread 0: in loop; received go_event
Block 0, Thread 1: in loop; received go_event
Block 0, Thread 0: finished counting
Block 0, Thread 1: finished counting
Block 0, Thread 0: recorded event; may loop back
Block 0, Thread 1: recorded event; may loop back
Block 0, Thread 0: copying result 2497500 back to managed memory
Block 0, Thread 1: copying result 2497500 back to managed memory
Block 0, Thread 0: exiting kernel
Block 0, Thread 1: exiting kernel
CPU: sending go_event
CPU: sending go_event
CPU: sending go_event
Result for Block 0 and Thread 0: 2
Result for Block 0 and Thread 1: 3
CPU: exiting program
答案 0 :(得分:0)
首先阅读其他答案。这个答案仅供参考。我要么将其投票或将其删除。
一种可能的实现是在设备存储器中具有一组标志或整数。 CUDA线程将阻塞(可能通过调用clock64()
),直到标志/整数达到某个值,表明CUDA线程有更多的工作要处理。这可能比使用第一类CUDA提供的同步原语慢,但比每次内核调用重新初始化__shared__
内存要快。它还涉及某种忙碌的等待/睡眠机制,我并不感到兴奋。
后续行动:它似乎有效 - 有些时候(printf
来电似乎有帮助)。我猜在托管内存中有一些未定义的行为让我受益。这是代码:
#include <iostream>
#include <numeric>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <cuda_runtime_api.h>
#include <cuda.h>
static void CheckCudaErrorAux (const char *, unsigned, const char *, cudaError_t);
#define CUDA_CHECK_RETURN(value) CheckCudaErrorAux(__FILE__,__LINE__, #value, value)
const int COUNT_DOWN_ITERATIONS = 1000;
const int KERNEL_MAXIMUM_LOOPS = 5; // IRL, we'd set this large enough to prevent hitting this value, unless the kernel is externally terminated
const int SIGNALS_TO_SEND_COUNT = 3;
const int BLOCK_COUNT = 1;
const int THREADS_PER_BLOCK = 2;
__device__ void count_down(int * shared_location_to_ensure_side_effect) {
int x = *shared_location_to_ensure_side_effect;
for (int i = 0; i < COUNT_DOWN_ITERATIONS; ++i) {
x += i;
}
*shared_location_to_ensure_side_effect = x;
}
__device__ void clock_block(clock_t clock_count)
{
//printf("time used so far: %lu\n", clock64());
clock_t start_clock = clock64();
while (clock64() - start_clock < clock_count);
}
/**
* CUDA kernel waits for flag to increment and then counts down.
*/
__global__ void kernel_block_via_flag(cudaStream_t stream, cudaEvent_t go_event, cudaEvent_t done_event, int ** cuda_malloc_managed_int_address, int * cuda_malloc_managed_synchronization_flag) {
__shared__ int local_copy_of_cuda_malloc_managed_int_address; // we always start at 0
printf("Block %i, Thread %i: entered kernel\n", blockIdx.x, threadIdx.x);
for (int i = 0; i < KERNEL_MAXIMUM_LOOPS; ++i) {
printf("Block %i, Thread %i: entered loop; waitin 4 go_event\n", blockIdx.x, threadIdx.x);
while (*cuda_malloc_managed_synchronization_flag <= i)
//printf("%lu\n", *cuda_malloc_managed_synchronization_flag);
clock_block(1000000000); // in cycles, not seconds!
cudaStreamWaitEvent(stream, go_event, cudaEventBlockingSync);
printf("Block %i, Thread %i: in loop; received go_event\n", blockIdx.x, threadIdx.x);
if (i == 0) { // we have received the signal and data is ready to be interpreted
local_copy_of_cuda_malloc_managed_int_address = cuda_malloc_managed_int_address[blockIdx.x][threadIdx.x];
}
count_down(&local_copy_of_cuda_malloc_managed_int_address);
printf("Block %i, Thread %i: finished counting\n", blockIdx.x, threadIdx.x);
cudaEventRecord(done_event, stream);
printf("Block %i, Thread %i: recorded event; may loop back\n", blockIdx.x, threadIdx.x);
}
printf("Block %i, Thread %i: copying result %i back to managed memory\n", blockIdx.x, threadIdx.x, local_copy_of_cuda_malloc_managed_int_address);
cuda_malloc_managed_int_address[blockIdx.x][threadIdx.x] = local_copy_of_cuda_malloc_managed_int_address;
printf("Block %i, Thread %i: exiting kernel\n", blockIdx.x, threadIdx.x);
}
int main(void)
{
int ** data;
cudaMallocManaged(&data, BLOCK_COUNT * sizeof(int *));
for (int b = 0; b < BLOCK_COUNT; ++b)
cudaMallocManaged(&(data[b]), THREADS_PER_BLOCK * sizeof(int));
cudaEvent_t go_event;
cudaEventCreateWithFlags(&go_event, cudaEventBlockingSync);
cudaEvent_t done_event;
cudaEventCreateWithFlags(&done_event, cudaEventBlockingSync);
cudaStream_t stream;
cudaStreamCreate(&stream);
int * synchronization_flag;
cudaMallocManaged(&synchronization_flag, sizeof(int));
//cudaMalloc(&synchronization_flag, sizeof(int));
//int my_copy_of_synchronization_flag = 0;
CUDA_CHECK_RETURN(cudaDeviceSynchronize()); // probably unnecessary
printf("CPU: spawning kernel\n");
kernel_block_via_flag<<<BLOCK_COUNT, THREADS_PER_BLOCK, sizeof(int), stream>>>(stream, go_event, done_event, data, synchronization_flag);
CUDA_CHECK_RETURN(cudaMemAdvise(synchronization_flag, sizeof(int), cudaMemAdviseSetPreferredLocation, cudaCpuDeviceId));
for (int i = 0; i < SIGNALS_TO_SEND_COUNT; ++i) {
usleep(4 * 1000 * 1000); // accepts time in microseconds
// Simulate the sending of the "next" piece of work
data[0][0] = i; // unrolled, because it's easier to read
data[0][1] = i + 1; // unrolled, because it's easier to read
printf("CPU: sending go_event\n");
//++my_copy_of_synchronization_flag;
//CUDA_CHECK_RETURN(cudaMemcpyAsync(synchronization_flag, &my_copy_of_synchronization_flag, sizeof(int), cudaMemcpyHostToDevice));
*synchronization_flag = *synchronization_flag + 1; // since it's monotonically increasing, and only written to by the CPU code, this is fine
}
CUDA_CHECK_RETURN(cudaDeviceSynchronize());
for (int b = 0; b < BLOCK_COUNT; ++b) {
for (int t = 0; t < THREADS_PER_BLOCK; ++t) {
printf("Result for Block %i and Thread %i: %i\n", b, t, data[b][t]);
}
}
for (int b = 0; b < BLOCK_COUNT; ++b)
cudaFree(data[b]);
cudaFree(data);
cudaFree(synchronization_flag);
cudaEventDestroy(done_event);
cudaEventDestroy(go_event);
cudaStreamDestroy(stream);
printf("CPU: exiting program");
return 0;
}
/**
* Check the return value of the CUDA runtime API call and exit
* the application if the call has failed.
*/
static void CheckCudaErrorAux (const char *file, unsigned line, const char *statement, cudaError_t err)
{
if (err == cudaSuccess)
return;
std::cerr << statement<<" returned " << cudaGetErrorString(err) << "("<<err<< ") at "<<file<<":"<<line << std::endl;
exit (1);
}
__global__ void kernel_block_via_flag(cudaStream_t stream, cudaEvent_t go_event, cudaEvent_t done_event, int ** cuda_malloc_managed_int_address, int * cuda_malloc_managed_synchronization_flag) {
__shared__ int local_copy_of_cuda_malloc_managed_int_address; // we always start at 0
printf("Block %i, Thread %i: entered kernel\n", blockIdx.x, threadIdx.x);
for (int i = 0; i < KERNEL_MAXIMUM_LOOPS; ++i) {
printf("Block %i, Thread %i: entered loop; waitin 4 go_event\n", blockIdx.x, threadIdx.x);
while (*cuda_malloc_managed_synchronization_flag <= i)
//printf("%i\n", *cuda_malloc_managed_synchronization_flag);
clock_block(1000000000);
cudaStreamWaitEvent(stream, go_event, cudaEventBlockingSync);
printf("Block %i, Thread %i: in loop; received go_event\n", blockIdx.x, threadIdx.x);
if (i == 0) { // we have received the signal and data is ready to be interpreted
local_copy_of_cuda_malloc_managed_int_address = cuda_malloc_managed_int_address[blockIdx.x][threadIdx.x];
}
count_down(&local_copy_of_cuda_malloc_managed_int_address);
printf("Block %i, Thread %i: finished counting\n", blockIdx.x, threadIdx.x);
cudaEventRecord(done_event, stream);
printf("Block %i, Thread %i: recorded event; may loop back\n", blockIdx.x, threadIdx.x);
}
printf("Block %i, Thread %i: copying result %i back to managed memory\n", blockIdx.x, threadIdx.x, local_copy_of_cuda_malloc_managed_int_address);
cuda_malloc_managed_int_address[blockIdx.x][threadIdx.x] = local_copy_of_cuda_malloc_managed_int_address;
printf("Block %i, Thread %i: exiting kernel\n", blockIdx.x, threadIdx.x);
}
输出:
CPU: spawning kernel
Block 0, Thread 0: entered kernel
Block 0, Thread 1: entered kernel
Block 0, Thread 0: entered loop; waitin 4 go_event
Block 0, Thread 1: entered loop; waitin 4 go_event
CPU: sending go_event
Block 0, Thread 0: in loop; received go_event
Block 0, Thread 1: in loop; received go_event
Block 0, Thread 0: finished counting
Block 0, Thread 1: finished counting
Block 0, Thread 0: recorded event; may loop back
Block 0, Thread 1: recorded event; may loop back
Block 0, Thread 0: entered loop; waitin 4 go_event
Block 0, Thread 1: entered loop; waitin 4 go_event
CPU: sending go_event
Block 0, Thread 0: in loop; received go_event
Block 0, Thread 1: in loop; received go_event
Block 0, Thread 0: finished counting
Block 0, Thread 1: finished counting
Block 0, Thread 0: recorded event; may loop back
Block 0, Thread 1: recorded event; may loop back
Block 0, Thread 0: entered loop; waitin 4 go_event
Block 0, Thread 1: entered loop; waitin 4 go_event
CPU: sending go_event
Block 0, Thread 0: in loop; received go_event
Block 0, Thread 1: in loop; received go_event
Block 0, Thread 0: finished counting
Block 0, Thread 1: finished counting
Block 0, Thread 0: recorded event; may loop back
Block 0, Thread 1: recorded event; may loop back
Block 0, Thread 0: entered loop; waitin 4 go_event
Block 0, Thread 1: entered loop; waitin 4 go_event
这仍然是一个糟糕的解决方案。我希望接受别人的回答。
答案 1 :(得分:0)
阅读此答案。我打算在达成共识后删除第一个,因为我希望它的唯一价值是历史性的。
一种可能的实现是在设备存储器中具有一组标志或整数。 CUDA线程将阻塞(例如,通过调用clock64()),直到标志/整数达到某个值,表明CUDA线程有更多的工作要处理。这可能比使用一流的CUDA提供的同步原语慢,但比每次内核调用重新初始化共享内存要快。它还涉及某种忙碌的等待/睡眠机制,我并不感到兴奋。
这是一个似乎有效的实现 - 但是,我担心我依赖托管内存的某些未定义行为,这些行为恰好有利于程序的执行。这是代码:
#include <iostream>
#include <numeric>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <cuda_runtime_api.h>
#include <cuda.h>
#include <chrono>
#include <thread>
static void CheckCudaErrorAux (const char *, unsigned, const char *, cudaError_t);
#define CUDA_CHECK_RETURN(value) CheckCudaErrorAux(__FILE__,__LINE__, #value, value)
const int COUNT_DOWN_ITERATIONS = 1000;
const int KERNEL_MAXIMUM_LOOPS = 1000; // IRL, we'd set this large enough to prevent hitting this value, unless the kernel is externally terminated
const int SIGNALS_TO_SEND_COUNT = 1000;
const int BLOCK_COUNT = 1;
const int THREADS_PER_BLOCK = 2;
__device__ void count_down(int * shared_location_to_ensure_side_effect) {
int x = *shared_location_to_ensure_side_effect;
for (int i = 0; i < COUNT_DOWN_ITERATIONS; ++i) {
x += i;
}
*shared_location_to_ensure_side_effect = x;
}
__device__ void clock_block(clock_t clock_count)
{
clock_t start_clock = clock64();
while (clock64() - start_clock < clock_count);
}
/**
* CUDA kernel waits for flag to increment and then counts down.
*/
__global__ void spawn_worker_threads(int ** cuda_malloc_managed_int_address, int * cuda_malloc_managed_go_flag, int * cuda_malloc_managed_done_flag) {
__shared__ int local_copy_of_cuda_malloc_managed_int_address; // we always start at 0
volatile int * my_go_flag = cuda_malloc_managed_go_flag;
volatile int * volatile_done_flag = cuda_malloc_managed_done_flag;
printf("Block %i, Thread %i: entered kernel\n", blockIdx.x, threadIdx.x);
for (int i = 0; i < KERNEL_MAXIMUM_LOOPS; ++i) {
while (*my_go_flag <= i) {
clock_block(10000); // in cycles, not seconds!
}
if (i == 0) { // we have received the signal and data is ready to be interpreted
local_copy_of_cuda_malloc_managed_int_address = cuda_malloc_managed_int_address[blockIdx.x][threadIdx.x];
}
count_down(&local_copy_of_cuda_malloc_managed_int_address);
// Wait for all worker threads to finish and then signal readiness for new work
__syncthreads(); // TODO: sync with other blocks too
if (blockIdx.x == 0 && threadIdx.x == 0)
*volatile_done_flag = *volatile_done_flag + 1;
//__threadfence_system(); // based on the documentation, it's not clear that this should actually help
}
printf("Block %i, Thread %i: copying result %i back to managed memory\n", blockIdx.x, threadIdx.x, local_copy_of_cuda_malloc_managed_int_address);
cuda_malloc_managed_int_address[blockIdx.x][threadIdx.x] = local_copy_of_cuda_malloc_managed_int_address;
printf("Block %i, Thread %i: exiting kernel\n", blockIdx.x, threadIdx.x);
}
int main(void)
{
int ** data;
cudaMallocManaged(&data, BLOCK_COUNT * sizeof(int *));
for (int b = 0; b < BLOCK_COUNT; ++b)
cudaMallocManaged(&(data[b]), THREADS_PER_BLOCK * sizeof(int));
int * go_flag;
int * done_flag;
cudaMallocManaged(&go_flag, sizeof(int));
cudaMallocManaged(&done_flag, sizeof(int));
volatile int * my_volatile_done_flag = done_flag;
printf("CPU: spawning kernel\n");
spawn_worker_threads<<<BLOCK_COUNT, THREADS_PER_BLOCK>>>(data, go_flag, done_flag);
// The cudaMemAdvise calls seem to be unnecessary, but they make it ~13% faster
CUDA_CHECK_RETURN(cudaMemAdvise(go_flag, sizeof(int), cudaMemAdviseSetPreferredLocation, cudaCpuDeviceId));
CUDA_CHECK_RETURN(cudaMemAdvise(done_flag, sizeof(int), cudaMemAdviseSetPreferredLocation, cudaCpuDeviceId));
for (int i = 0; i < SIGNALS_TO_SEND_COUNT; ++i) {
if (i % 50 == 0) printf("============== CPU: On iteration %i ============\n", i);
// Simulate the writing of the "next" piece of work
data[0][0] = i; // unrolled, because it's easier to read this way
data[0][1] = i + 1; // unrolled, because it's easier to read
*go_flag = *go_flag + 1; // since it's monotonically increasing, and only written to by the CPU code, this is fine
while (*my_volatile_done_flag < i)
std::this_thread::sleep_for(std::chrono::microseconds(50));
}
CUDA_CHECK_RETURN(cudaDeviceSynchronize());
for (int b = 0; b < BLOCK_COUNT; ++b)
for (int t = 0; t < THREADS_PER_BLOCK; ++t)
printf("Result for Block %i and Thread %i: %i\n", b, t, data[b][t]);
for (int b = 0; b < BLOCK_COUNT; ++b)
cudaFree(data[b]);
cudaFree(data);
cudaFree(go_flag);
cudaFree(done_flag);
printf("CPU: exiting program");
return 0;
}
/**
* Check the return value of the CUDA runtime API call and exit
* the application if the call has failed.
*/
static void CheckCudaErrorAux (const char *file, unsigned line, const char *statement, cudaError_t err)
{
if (err == cudaSuccess)
return;
std::cerr << statement<<" returned " << cudaGetErrorString(err) << "("<<err<< ") at "<<file<<":"<<line << std::endl;
exit (1);
}
这是输出,它说的是生成50ms。那个&#34;回收&#34;大约50微秒。这完全符合我的实际应用程序的容忍度。
Starting timer for Synchronization timer
CPU: spawning kernel
============== CPU: On iteration 0 ============
============== CPU: On iteration 50 ============
============== CPU: On iteration 100 ============
============== CPU: On iteration 150 ============
============== CPU: On iteration 200 ============
============== CPU: On iteration 250 ============
============== CPU: On iteration 300 ============
============== CPU: On iteration 350 ============
============== CPU: On iteration 400 ============
============== CPU: On iteration 450 ============
============== CPU: On iteration 500 ============
============== CPU: On iteration 550 ============
============== CPU: On iteration 600 ============
============== CPU: On iteration 650 ============
============== CPU: On iteration 700 ============
============== CPU: On iteration 750 ============
============== CPU: On iteration 800 ============
============== CPU: On iteration 850 ============
============== CPU: On iteration 900 ============
============== CPU: On iteration 950 ============
Block 0, Thread 0: entered kernel
Block 0, Thread 1: entered kernel
Block 0, Thread 0: copying result 499500001 back to managed memory
Block 0, Thread 1: copying result 499500001 back to managed memory
Block 0, Thread 0: exiting kernel
Block 0, Thread 1: exiting kernel
Result for Block 0 and Thread 0: 499500001
Result for Block 0 and Thread 1: 499500001
CPU: exiting program
感谢@einpoklum和@robertcrovella建议使用volatile
。它似乎有效,但我对volatile
缺乏经验。根据我所阅读的内容,这是一个有效且正确的用法,应该导致定义的行为。你是否会介意确认或纠正这个结论?