我的模型大约是2.4GB。在我的推理步骤中,我想在每个GPU中通过多处理方法加载模型。这意味着我尝试在一个GPU中制作两个进程,每个进程加载一个模型。 在我完成每个会话的配置后,每个会话获得大约5GB内存,但我仍然遇到“来自设备:CUDA_ERROR_OUT_OF_MEMORY”。我想知道......寻求帮助
GPU信息:
[search @ qrwt01 / home / s / apps / qtfserverd / bin] $ nvidia-smi 2017年9月14日星期四21:42:48
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26 Driver Version: 375.26 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0000:08:00.0 Off | 0 |
| N/A 48C P0 61W / 149W | 11366MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 0000:09:00.0 Off | 0 |
| N/A 32C P0 72W / 149W | 11359MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 33056 C ...ome/s/apps/qtfserverd/etc/qtfserverd.conf 5823MiB |
| 0 33057 C ...ome/s/apps/qtfserverd/etc/qtfserverd.conf 5515MiB |
| 1 33058 C ...ome/s/apps/qtfserverd/etc/qtfserverd.conf 5823MiB |
| 1 33059 C ...ome/s/apps/qtfserverd/etc/qtfserverd.conf 5516MiB |
+-----------------------------------------------------------------------------+
会话配置:
void* create_session(void* graph, std::string& checkpoint_path,
int intra_op_threads, int inter_op_threads, std::string& device_list) {
Session* session = NULL;
SessionOptions sess_opts;
//int NUM_THREADS = 8;
if (intra_op_threads > 0) {
sess_opts.config.set_intra_op_parallelism_threads(intra_op_threads);
}
if (inter_op_threads > 0) {
sess_opts.config.set_inter_op_parallelism_threads(inter_op_threads);
}
sess_opts.config.set_allow_soft_placement(true);
sess_opts.config.mutable_gpu_options()->set_visible_device_list(device_list);
sess_opts.config.mutable_gpu_options()->set_allocator_type("BFC");
sess_opts.config.mutable_gpu_options()->set_per_process_gpu_memory_fraction(0.5);
sess_opts.config.mutable_gpu_options()->set_allow_growth(true);
Status status = NewSession(sess_opts, &session);
if (!status.ok()) {
fprintf(stderr, "Create Session Failed %s\n", status.ToString().c_str());
return NULL;
}
错误信息
将/home/search/tensorflow/deploy_combine.model.meta图加载到/ gpu:1成功 2017-09-14 21:42:31.188212:I tensorflow / core / common_runtime / gpu / gpu_device.cc:965]找到具有属性的设备0: 名称:特斯拉K80专业:3个未成年人:7个memoryClockRate(GHz):0.8235 pciBusID:0000:09:00.0 totalMemory:11.17GiB freeMemory:11.05GiB 2017-09-14 21:42:31.188260:I tensorflow / core / common_runtime / gpu / gpu_device.cc:1055]创建TensorFlow设备(/ device:GPU:0) - > (设备:1,名称:特斯拉K80,pci总线ID:0000:09:00.0,计算能力:3.7) qss_switch:1,lstm_switch:1 qss_switch:1,lstm_switch:1 2017-09-14 21:42:33.826598:E tensorflow / stream_executor / cuda / cuda_driver.cc:936]无法从设备分配1.58G(1701773312字节):CUDA_ERROR_OUT_OF_MEMORY 2017-09-14 21:42:33.838694:E tensorflow / stream_executor / cuda / cuda_driver.cc:936]无法从设备分配1.43G(1531596032字节):CUDA_ERROR_OUT_OF_MEMORY 2017-09-14 21:42:33.893832:E tensorflow / stream_executor / cuda / cuda_driver.cc:936]无法从设备分配439.82M(461180672字节):CUDA_ERROR_OUT_OF_MEMORY 2017-09-14 21:42:33.903917:E tensorflow / stream_executor / cuda / cuda_driver.cc:936]无法从设备分配439.82M(461180672字节):CUDA_ERROR_OUT_OF_MEMORY 2017-09-14 21:42:33.913843:E tensorflow / stream_executor / cuda / cuda_driver.cc:936]无法从设备分配439.82M(461180672字节):CUDA_ERROR_OUT_OF_MEMORY 2017-09-14 21:42:33.924008:E tensorflow / stream_executor / cuda / cuda_driver.cc:936]无法从设备分配439.82M(461180672字节):CUDA_ERROR_OUT_OF_MEMORY 2017-09-14 21:42:33.935385:E tensorflow / stream_executor / cuda / cuda_driver.cc:936]无法从设备分配439.82M(461180672字节):CUDA_ERROR_OUT_OF_MEMORY 2017-09-14 21:42:33.946556:E tensorflow / stream_executor / cuda / cuda_driver.cc:936]无法从设备分配439.82M(461180672字节):CUDA_ERROR_OUT_OF_MEMORY 2017-09-14 21:42:33.956340:E tensorflow / stream_executor / cuda / cuda_driver。
答案 0 :(得分:0)
尝试减少操作参数或批量执行计算,因为错误表明所有GPU资源都已耗尽。