应用错误收集

TensorRT复制数据花费大量时间

时间：2020-04-07 10:43:10

标签： tensorrt

说明

我将resNet v1_50 model转换为tensorRT并以INT8精度运行。当batch_size为16时，推断成本为10.73ms /批次，但是add copyInputToDevice()和copyOutputToHost()，成本为14,88ms /批次，而TF-TRT模型的成本为13.08ms / batch（内部数据传输）。我还尝试了copyInputToDeviceAsync()和copyOutputToHostAsync()并使用context->enqueue()运行模型，但是时间成本并没有减少。有什么方法可以减少数据传输时间？非常感谢！

环境
TensorRT版本：7.0
GPU类型：T4
Nvidia驱动程序版本：410.79
CUDA版本：10.0
CUDNN版本：7.6.4
操作系统+版本：Centos 7
Python版本（如果适用）：2.7
TensorFlow版本（如果适用）：1.15

代码

const auto t_start = std::chrono::high_resolution_clock::now();

// Create CUDA stream for the execution of this inference
cudaStream_t stream;
CHECK(cudaStreamCreate(&stream));

// Asynchronously copy data from host input buffers to device input buffers
buffers.copyInputToDeviceAsync(stream);

// Asynchronously enqueue the inference work
if (!context->enqueue(mParams.batchSize, buffers.getDeviceBindings().data(), stream, nullptr))
{
    return false;
}

// Asynchronously copy data from device output buffers to host output buffers
buffers.copyOutputToHostAsync(stream);

// Wait for the work in the stream to complete
cudaStreamSynchronize(stream);

// Release stream
cudaStreamDestroy(stream);

const auto t_end = std::chrono::high_resolution_clock::now();

我想知道上面的代码是否可以减少数据传输时间，这些代码对吗？

0 个答案:

没有答案