将cudaMalloc的数据归零的好方法是什么?假设从CPU使用cudaMemset
或cudaMemsetAsync
会导致与其他cuda API调用的同步问题,从而迫使您执行其他操作。
编辑1
在下面的第一张图片中,您可以看到线程1291997440已发出cudaMemcpyAsync
,并花了一些时间执行它。由于某种原因,此cudaMemcpyAsync
似乎阻塞了cudaMemsetAsync
,如下面的第二张图片所示。请注意,每个CPU线程都在其自己的流中对这些操作进行排队。某位声誉卓著的人提到异地使用内核而不是使用cudaMemsetAsync
调用可能会导致更快地清除内存,这就是我追求这个问题的原因。
编辑2
在这一点上,我已经改进了代码(通过减少HtoD和DtoH副本的大小)以防止出现此问题。上面的图片是前一天晚上的。如果评论是100%正确,则性能分析报告中肯定还有其他一些我没有注意到的问题。在新版本的代码中,使用cudaMemsetAsync
与调用内核清除内存之间没有明显的区别。
答案 0 :(得分:2)
在这里,我展示了使用三种不同方法将约5.5GB数据归零的结果。该代码使用-O3
进行了编译,并在具有16 GB内存的V100
上运行。
方法A:cudaMemset
要建立基线,我使用cudaMemset
将CPU中的数据清零。这非常快,但是如果有许多cudaMemsetAsync
在运行中,即使是cudaMemcpy
版本也可以在运行时进行序列化。
结果:6毫秒
方法B:memset
调用memset
可能会调用CPU和GPU两者中最糟糕的一个。 memset
的感觉是,一旦退出,数据将完全符合您的指示。当然,在存在其他内核,竞争条件等情况下,情况并非如此。但是,这是我猜测为什么它这么慢的原因。
结果:241毫秒
方法C:合并写入
以结合的方式编写似乎在CPU和GPU领域都是最好的。它与CPU发行的cudaMemset
一样快,而且对进行合并写入的任何程序员来说,当然也存在竞争条件,这很明显。
结果:6毫秒
结论
如果无法从CPU使用cudaMemset[Async]
,则使用每个块具有32个或更多线程的合并写入。
程序输出
Starting timer for calling cudaMemset from CPU
Stopping timer for calling cudaMemset from CPU took 0.006015s
Starting timer for calling kernel<80,1> that uses memset
Stopping timer for calling kernel<80,1> that uses memset took 0.393921s
Starting timer for calling kernel<80,2> that uses memset
Stopping timer for calling kernel<80,2> that uses memset took 0.300473s
Starting timer for calling kernel<80,4> that uses memset
Stopping timer for calling kernel<80,4> that uses memset took 0.269686s
Starting timer for calling kernel<80,8> that uses memset
Stopping timer for calling kernel<80,8> that uses memset took 0.241374s
Starting timer for calling kernel<80,16> that uses memset
Stopping timer for calling kernel<80,16> that uses memset took 0.645509s
Starting timer for calling kernel<80,32> that uses memset
Stopping timer for calling kernel<80,32> that uses memset took 0.611437s
Starting timer for calling kernel<80,64> that uses memset
Stopping timer for calling kernel<80,64> that uses memset took 0.611276s
Starting timer for calling kernel<80,128> that uses memset
Stopping timer for calling kernel<80,128> that uses memset took 0.459663s
Starting timer for calling kernel<80,256> that uses memset
Stopping timer for calling kernel<80,256> that uses memset took 0.308788s
Starting timer for calling kernel<80,512> that uses memset
Stopping timer for calling kernel<80,512> that uses memset took 0.595893s
Starting timer for calling kernel<80,1024> that uses memset
Stopping timer for calling kernel<80,1024> that uses memset took 2.552866s
Starting timer for calling kernel<80,1> that performs coalesced writes
Stopping timer for calling kernel<80,1> that performs coalesced writes took 0.136967s
Starting timer for calling kernel<80,2> that performs coalesced writes
Stopping timer for calling kernel<80,2> that performs coalesced writes took 0.068426s
Starting timer for calling kernel<80,4> that performs coalesced writes
Stopping timer for calling kernel<80,4> that performs coalesced writes took 0.039974s
Starting timer for calling kernel<80,8> that performs coalesced writes
Stopping timer for calling kernel<80,8> that performs coalesced writes took 0.017121s
Starting timer for calling kernel<80,16> that performs coalesced writes
Stopping timer for calling kernel<80,16> that performs coalesced writes took 0.008586s
Starting timer for calling kernel<80,32> that performs coalesced writes
Stopping timer for calling kernel<80,32> that performs coalesced writes took 0.006139s
Starting timer for calling kernel<80,64> that performs coalesced writes
Stopping timer for calling kernel<80,64> that performs coalesced writes took 0.006075s
Starting timer for calling kernel<80,128> that performs coalesced writes
Stopping timer for calling kernel<80,128> that performs coalesced writes took 0.006093s
Starting timer for calling kernel<80,256> that performs coalesced writes
Stopping timer for calling kernel<80,256> that performs coalesced writes took 0.006479s
Starting timer for calling kernel<80,512> that performs coalesced writes
Stopping timer for calling kernel<80,512> that performs coalesced writes took 0.006972s
Starting timer for calling kernel<80,1024> that performs coalesced writes
Stopping timer for calling kernel<80,1024> that performs coalesced writes took 0.007354s
测试实施
memset_timing.cu
#include <iostream>
#include <numeric>
#include <stdlib.h>
#include "timer.h"
static void CheckCudaErrorAux (const char *, unsigned, const char *, cudaError_t);
#define CUDA_CHECK_RETURN(value) CheckCudaErrorAux(__FILE__,__LINE__, #value, value)
#define round_up(x, multiple) (((x + multiple - 1) / multiple) * multiple)
const long COUNT = 80 << 24;
const int GPU_CACHE_LINE_SIZE_IN_BYTES = 32;
const long SIZE_OF_DATA = sizeof(int) * COUNT;
__global__ void clear_scratch_space_kernel(int * data, int blocks, int threads) {
// BOZO: change the code to just error out if we're any of the border cases below
const int idx = blockIdx.x * threads + threadIdx.x;
long size = sizeof(int) * COUNT;
long size_of_typical_chunk = round_up(size / (blocks * threads), GPU_CACHE_LINE_SIZE_IN_BYTES);
// Due to truncation, the threads at the end won't have anything to do. This is a little sloppy but costs us
// hardly anything in performance, so we do the simpler thing.
long this_threads_offset = idx * size_of_typical_chunk;
if (this_threads_offset > SIZE_OF_DATA) {
return;
}
long size_of_this_threads_chunk;
if (this_threads_offset + size_of_typical_chunk >= SIZE_OF_DATA) {
// We are the last thread, so we do a partial write
size_of_this_threads_chunk = SIZE_OF_DATA - this_threads_offset;
} else {
size_of_this_threads_chunk = size_of_typical_chunk;
}
void * starting_address = reinterpret_cast<void *>(reinterpret_cast<char *>(data) + this_threads_offset);
memset((void *) starting_address, 0, size_of_this_threads_chunk);
}
__global__ void clear_scratch_space_with_coalesced_writes_kernel(int * data, int blocks, int threads) {
if (COUNT % (blocks * threads) != 0) {
printf("Adjust the SIZE_OF_DATA so it's divisible by the number of (blocks * threads)\n");
}
const long count_of_ints_in_each_blocks_chunk = COUNT / blocks;
int block = blockIdx.x;
int thread = threadIdx.x;
const long rounds_needed = count_of_ints_in_each_blocks_chunk / threads;
const long this_blocks_starting_offset = block * count_of_ints_in_each_blocks_chunk;
//printf("Clearing %li ints starting at offset %li\n", count_of_ints_in_each_blocks_chunk, this_blocks_starting_offset);
int * this_threads_base_pointer = &data[this_blocks_starting_offset + thread];
for (int round = 0; round < rounds_needed; ++round) {
*this_threads_base_pointer = 0;
this_threads_base_pointer += threads;
}
}
void set_gpu_data_to_ones(int * data_on_gpu) {
cudaMemset(data_on_gpu, 1, SIZE_OF_DATA);
CUDA_CHECK_RETURN(cudaDeviceSynchronize());
}
void check_gpu_data_is_zeroes(int * data_on_gpu, char * data_on_cpu) {
cudaMemcpy(data_on_cpu, data_on_gpu, SIZE_OF_DATA, cudaMemcpyDeviceToHost);
for (long i = 0; i < SIZE_OF_DATA; ++i) {
if (data_on_cpu[i] != 0) {
printf("Failed to zero-out byte offset %i in the data\n", i);
}
}
}
int main(void)
{
const long count = COUNT;
int * data_on_gpu;
char * data_on_cpu = (char *) malloc(SIZE_OF_DATA);
if (data_on_cpu == NULL) {
printf("Failed to allocate data on cpu");
}
CUDA_CHECK_RETURN(cudaMalloc(&data_on_gpu, sizeof(int) * count));
{
Timer memset_timer("calling cudaMemset from CPU");
memset_timer.start();
CUDA_CHECK_RETURN(cudaMemset(data_on_gpu, 0, SIZE_OF_DATA));
CUDA_CHECK_RETURN(cudaDeviceSynchronize());
memset_timer.stop_and_report();
}
for (int threads = 1; threads <= 1024; threads *= 2) {
set_gpu_data_to_ones(data_on_gpu);
char buffer[200];
sprintf(buffer, "calling kernel<80,%i> that uses memset", threads);
Timer memset_timer(buffer);
memset_timer.start();
clear_scratch_space_kernel<<<80, threads>>>(data_on_gpu, 80, threads);
CUDA_CHECK_RETURN(cudaDeviceSynchronize());
memset_timer.stop_and_report();
check_gpu_data_is_zeroes(data_on_gpu, data_on_cpu);
}
for (int threads = 1; threads <= 1024; threads *= 2) {
set_gpu_data_to_ones(data_on_gpu);
char buffer[200];
sprintf(buffer, "calling kernel<80,%i> that performs coalesced writes", threads);
Timer memset_timer(buffer);
memset_timer.start();
clear_scratch_space_with_coalesced_writes_kernel<<<80, threads>>>(data_on_gpu, 80, threads);
CUDA_CHECK_RETURN(cudaDeviceSynchronize());
memset_timer.stop_and_report();
check_gpu_data_is_zeroes(data_on_gpu, data_on_cpu);
}
free(data_on_cpu);
}
/**
* Check the return value of the CUDA runtime API call and exit
* the application if the call has failed.
*/
static void CheckCudaErrorAux (const char *file, unsigned line, const char *statement, cudaError_t err)
{
if (err == cudaSuccess)
return;
std::cerr << statement<<" returned " << cudaGetErrorString(err) << "("<<err<< ") at "<<file<<":"<<line << std::endl;
exit (1);
}
Timer.h
#include <string>
#include <chrono>
class Timer {
public:
Timer(std::string name_, bool allow_output = true);
virtual ~Timer();
void start();
void start_or_restart();
void stop(bool force_no_output = false);
void report(const int count = 0, bool preface_with_spaces = true);
void stop_and_report(const int count = 0);
double duration_in_seconds();
long duration_in_microseconds();
private:
std::string name;
// even though we call report, we still might suppress output since the output is often a type of debugging info
bool allow_output;
std::chrono::time_point<std::chrono::system_clock> start_time;
std::chrono::time_point<std::chrono::system_clock> end_time;
bool started_before = false;
bool currently_rolling = false; // if timer (i.e., the clock) is currently rolling
double duration = -1.0;
};
Timer.cpp
#include <stdexcept>
#include "timer.h"
Timer::Timer(std::string name_, bool allow_output_) {
name = name_;
allow_output = allow_output_;
}
Timer::~Timer() {
}
void Timer::start() {
#ifdef DEBUG
if(started_before) {
printf("Attempting to start same timer twice. Exiting.\n");
throw std::runtime_error("Attempting to start timer that was previously started");
}
#endif
if (allow_output) {
printf("Starting timer for %s\n", name.c_str());
}
start_time = std::chrono::system_clock::now();
currently_rolling = true;
started_before = true;
duration = 0.0;
}
void Timer::start_or_restart() {
if (currently_rolling) {
throw std::runtime_error("Can't start or restart a timer that's already rolling.");
}
if (!started_before && allow_output) {
printf("Starting timer for %s\n", name.c_str());
}
started_before = true;
start_time = std::chrono::system_clock::now();
currently_rolling = true;
if (duration < 0.0) {
duration = 0.0;
}
}
void Timer::stop(bool force_no_output) {
if (!force_no_output) { // Slight style violation: I prefer nested if's over && statements with two && operators
if (allow_output && duration <= 0.0) {
printf("Stopping timer for %s\n", name.c_str());
}
}
end_time = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end_time - start_time;
currently_rolling = false;
duration += elapsed_seconds.count();
}
void Timer::stop_and_report(const int count) {
stop(true);
report(count, false);
}
double Timer::duration_in_seconds() {
return duration;
}
long Timer::duration_in_microseconds() {
return static_cast<long>(duration * 1000000);
}
void Timer::report(const int count, bool preface_with_spaces) {
std::string preface;
if (preface_with_spaces) {
preface = " ";
} else {
preface = "Stopping ";
}
if (allow_output) {
if (!started_before) {
printf("%stimer for %s was never started\n", preface.c_str(), name.c_str());
} else if (count > 0) {
double average = (duration / static_cast<double>(count)) * 1000.0;
printf("%stimer for %s took %fs, %.3lfus each\n", preface.c_str(), name.c_str(), duration, average * 1000.0);
} else {
printf("%stimer for %s took %fs\n", preface.c_str(), name.c_str(), duration);
}
}
}