我是GPU世界的新手,刚刚安装了CUDA来编写程序。我玩推力库但发现在将数据上传到GPU时速度太慢了。在我可怕的桌面上,主机到设备部分只有大约35MB / s。怎么回事?
环境:Visual Studio 2012,CUDA 5.0,GTX760,Intel-i7,Windows 7 x64
GPU带宽测试:
主机到设备的传输速度应该至少为11GB / s,反之亦然!但事实并非如此!
这是测试程序:
#include <iostream>
#include <ctime>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#define N 32<<22
int main(void)
{
using namespace std;
cout<<"GPU bandwidth test via thrust, data size: "<< (sizeof(double)*N) / 1000000000.0 <<" Gbytes"<<endl;
cout<<"============program start=========="<<endl;
int now = time(0);
cout<<"Initializing h_vec...";
thrust::host_vector<double> h_vec(N,0.0f);
cout<<"time spent: "<<time(0)-now<<"secs"<<endl;
now = time(0);
cout<<"Uploading data to GPU...";
thrust::device_vector<double> d_vec = h_vec;
cout<<"time spent: "<<time(0)-now<<"secs"<<endl;
now = time(0);
cout<<"Downloading data to h_vec...";
thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin());
cout<<"time spent: "<<time(0)-now<<"secs"<<endl<<endl;
system("PAUSE");
return 0;
}
编程输出:
下载速度:不到1秒,与标称相比非常有意义 11GB /秒。
上传速度:1.07374GB / 32秒约为33.5 MB / s,完全没有意义。
有谁知道原因?或者它只是推力的方式?
谢谢!
答案 0 :(得分:9)
您的比较存在一些缺陷,其中一些缺陷在评论中有所涉及。
bandwidthTest
正在使用PINNED
内存分配,推力不使用。因此推力数据传输速率会变慢。这通常会导致大约2倍的因素(即固定内存传输通常比可分页内存传输快约2倍。如果您希望与bandwidthTest
进行更好的比较,请使用--memory=pageable
开关运行它。这是一个执行正确计时的代码:
$ cat t213.cu
#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/copy.h>
#include <thrust/fill.h>
#define DSIZE ((1UL<<20)*32)
int main(){
thrust::device_vector<int> d_data(DSIZE);
thrust::host_vector<int> h_data(DSIZE);
float et;
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
thrust::fill(h_data.begin(), h_data.end(), 1);
thrust::copy(h_data.begin(), h_data.end(), d_data.begin());
std::cout<< "warm up iteration " << d_data[0] << std::endl;
thrust::fill(d_data.begin(), d_data.end(), 2);
thrust::copy(d_data.begin(), d_data.end(), h_data.begin());
std::cout<< "warm up iteration " << h_data[0] << std::endl;
thrust::fill(h_data.begin(), h_data.end(), 3);
cudaEventRecord(start);
thrust::copy(h_data.begin(), h_data.end(), d_data.begin());
cudaEventRecord(stop);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&et, start, stop);
std::cout<<"host to device iteration " << d_data[0] << " elapsed time: " << (et/(float)1000) << std::endl;
std::cout<<"apparent bandwidth: " << (((DSIZE*sizeof(int))/(et/(float)1000))/((float)1048576)) << " MB/s" << std::endl;
thrust::fill(d_data.begin(), d_data.end(), 4);
cudaEventRecord(start);
thrust::copy(d_data.begin(), d_data.end(), h_data.begin());
cudaEventRecord(stop);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&et, start, stop);
std::cout<<"device to host iteration " << h_data[0] << " elapsed time: " << (et/(float)1000) << std::endl;
std::cout<<"apparent bandwidth: " << (((DSIZE*sizeof(int))/(et/(float)1000))/((float)1048576)) << " MB/s" << std::endl;
std::cout << "finished" << std::endl;
return 0;
}
我编译(我有一个带有cc2.0设备的PCIE Gen2系统)
$ nvcc -O3 -arch=sm_20 -o t213 t213.cu
当我运行它时,我得到以下结果:
$ ./t213
warm up iteration 1
warm up iteration 2
host to device iteration 3 elapsed time: 0.0476644
apparent bandwidth: 2685.44 MB/s
device to host iteration 4 elapsed time: 0.0500736
apparent bandwidth: 2556.24 MB/s
finished
$
这对我来说是正确的,因为我的系统上的bandwidthTest
会报告任何方向上的6GB / s,因为我有一个PCIE Gen2系统。由于推力使用可分页,而不是固定内存,我得到大约一半的带宽,即3GB / s,推力报告大约2.5GB / s。
为了比较,这是我的系统上的带宽测试,使用可分页存储器:
$ /usr/local/cuda/samples/bin/linux/release/bandwidthTest --memory=pageable
[CUDA Bandwidth Test] - Starting...
Running on...
Device 0: Quadro 5000
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PAGEABLE Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2718.2
Device to Host Bandwidth, 1 Device(s)
PAGEABLE Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2428.2
Device to Device Bandwidth, 1 Device(s)
PAGEABLE Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 99219.1
$