Question

我正在使用Tesla K80运行Tensor-Flow，我收到OUT_OF_MEMORY错误。任何人都可以帮我解决这个问题吗？

说明

NVIDIA Tesla K80 GPU应具有11.87GB内存。
我已将该过程设置为使用90％GPU内存。
我重新运行程序3次，每次允许我运行更少的步骤，直到发生错误。

日志： 1.

2017-03-28 23:34:34.252047: step 830, loss = 0.36 (124.4 examples/sec; 0.402 sec/batch)
2017-03-28 23:34:38.676589: step 840, loss = 0.41 (106.2 examples/sec; 0.471 sec/batch)
2017-03-28 23:34:42.826278: step 850, loss = 0.45 (116.0 examples/sec; 0.431 sec/batch)
2017-03-28 23:34:47.274519: step 860, loss = 0.41 (113.9 examples/sec; 0.439 sec/batch)
2017-03-28 23:34:51.793204: step 870, loss = 0.41 (129.7 examples/sec; 0.385 sec/batch)
2017-03-28 23:34:55.928524: step 880, loss = 0.40 (119.1 examples/sec; 0.420 sec/batch)
2017-03-28 23:35:00.152923: step 890, loss = 0.36 (129.1 examples/sec; 0.387 sec/batch)
2017-03-28 23:35:04.347550: step 900, loss = 0.35 (113.9 examples/sec; 0.439 sec/batch)
2017-03-28 23:35:11.431462: step 910, loss = 0.38 (109.9 examples/sec; 0.455 sec/batch)
2017-03-28 23:35:15.439966: E tensorflow/stream_executor/cuda/cuda_driver.cc:924] failed to alloc 8589934592 bytes on host: CUDA_ERROR_OUT_OF_MEMORY
2017-03-28 23:35:15.440022: W ./tensorflow/core/common_runtime/gpu/pool_allocator.h:195] could not allocate pinned host memory of size: 8589934592

2

Training from fresh start
2017-03-28 23:37:31.624590: step 0, loss = 0.73 (1.2 examples/sec; 41.462 sec/batch)
2017-03-28 23:37:40.920874: step 10, loss = 0.62 (116.7 examples/sec; 0.428 sec/batch)
2017-03-28 23:37:45.000310: step 20, loss = 0.55 (127.2 examples/sec; 0.393 sec/batch)
2017-03-28 23:37:49.345005: step 30, loss = 0.50 (109.1 examples/sec; 0.458 sec/batch)
2017-03-28 23:37:53.544604: step 40, loss = 0.46 (117.5 examples/sec; 0.426 sec/batch)
2017-03-28 23:37:57.696332: step 50, loss = 0.43 (122.8 examples/sec; 0.407 sec/batch)
2017-03-28 23:38:01.880436: step 60, loss = 0.44 (111.9 examples/sec; 0.447 sec/batch)
2017-03-28 23:38:06.169541: step 70, loss = 0.39 (97.2 examples/sec; 0.514 sec/batch)
2017-03-28 23:38:10.216505: step 80, loss = 0.41 (120.6 examples/sec; 0.414 sec/batch)
2017-03-28 23:38:14.399862: step 90, loss = 0.42 (119.3 examples/sec; 0.419 sec/batch)
2017-03-28 23:38:18.621185: step 100, loss = 0.42 (112.1 examples/sec; 0.446 sec/batch)
2017-03-28 23:38:19.485396: E tensorflow/stream_executor/cuda/cuda_driver.cc:924] failed to alloc 8589934592 bytes on host: CUDA_ERROR_OUT_OF_MEMORY
2017-03-28 23:38:19.485583: W ./tensorflow/core/common_runtime/gpu/pool_allocator.h:195] could not allocate pinned host memory of size: 8589934592
2017-03-28 23:38:19.485729: E tensorflow/stream_executor/cuda/cuda_driver.cc:924] failed to alloc 7730940928 bytes on host: CUDA_ERROR_OUT_OF_MEMORY
2017-03-28 23:38:19.485823: W ./tensorflow/core/common_runtime/gpu/pool_allocator.h:195] could not allocate pinned host memory of size: 7730940928
2017-03-28 23:38:19.485932: E tensorflow/stream_executor/cuda/cuda_driver.cc:924] failed to alloc 6957846528 bytes on host: CUDA_ERROR_OUT_OF_MEMORY
2017-03-28 23:38:19.486023: W ./tensorflow/core/common_runtime/gpu/pool_allocator.h:195] could not allocate pinned host memory of size: 6957846528
2017-03-28 23:38:19.486165: E tensorflow/stream_executor/cuda/cuda_driver.cc:924] failed to alloc 6262061568 bytes on host: CUDA_ERROR_OUT_OF_MEMORY
2017-03-28 23:38:19.486233: W ./tensorflow/core/common_runtime/gpu/pool_allocator.h:195] could not allocate pinned host memory of size: 6262061568
2017-03-28 23:38:19.486324: E tensorflow/stream_executor/cuda/cuda_driver.cc:924] failed to alloc 5635855360 bytes on host: CUDA_ERROR_OUT_OF_MEMORY
2017-03-28 23:38:19.486390: W ./tensorflow/core/common_runtime/gpu/pool_allocator.h:195] could not allocate pinned host memory of size: 5635855360
Killed

3

Training from fresh start
2017-03-28 23:39:55.459584: step 0, loss = 0.74 (1.2 examples/sec; 40.841 sec/batch)
2017-03-28 23:40:04.598183: step 10, loss = 0.63 (118.7 examples/sec; 0.421 sec/batch)
2017-03-28 23:40:08.637012: step 20, loss = 0.56 (128.6 examples/sec; 0.389 sec/batch)
2017-03-28 23:40:12.954617: step 30, loss = 0.52 (121.4 examples/sec; 0.412 sec/batch)
2017-03-28 23:40:17.101882: step 40, loss = 0.46 (133.6 examples/sec; 0.374 sec/batch)
2017-03-28 23:40:21.288942: step 50, loss = 0.44 (119.3 examples/sec; 0.419 sec/batch)
2017-03-28 23:40:24.083043: E tensorflow/stream_executor/cuda/cuda_driver.cc:924] failed to alloc 8589934592 bytes on host: CUDA_ERROR_OUT_OF_MEMORY
2017-03-28 23:40:24.083350: W ./tensorflow/core/common_runtime/gpu/pool_allocator.h:195] could not allocate pinned host memory of size: 8589934592
2017-03-28 23:40:24.083532: E tensorflow/stream_executor/cuda/cuda_driver.cc:924] failed to alloc 7730940928 bytes on host: CUDA_ERROR_OUT_OF_MEMORY
2017-03-28 23:40:24.083630: W ./tensorflow/core/common_runtime/gpu/pool_allocator.h:195] could not allocate pinned host memory of size: 7730940928
2017-03-28 23:40:24.083769: E tensorflow/stream_executor/cuda/cuda_driver.cc:924] failed to alloc 6957846528 bytes on host: CUDA_ERROR_OUT_OF_MEMORY
2017-03-28 23:40:24.083859: W ./tensorflow/core/common_runtime/gpu/pool_allocator.h:195] could not allocate pinned host memory of size: 6957846528
2017-03-28 23:40:24.084943: E tensorflow/stream_executor/cuda/cuda_driver.cc:924] failed to alloc 6262061568 bytes on host: CUDA_ERROR_OUT_OF_MEMORY
2017-03-28 23:40:24.085062: W ./tensorflow/core/common_runtime/gpu/pool_allocator.h:195] could not allocate pinned host memory of size: 6262061568
2017-03-28 23:40:24.086309: E tensorflow/stream_executor/cuda/cuda_driver.cc:924] failed to alloc 5635855360 bytes on host: CUDA_ERROR_OUT_OF_MEMORY
2017-03-28 23:40:24.086419: W ./tensorflow/core/common_runtime/gpu/pool_allocator.h:195] could not allocate pinned host memory of size: 5635855360
2017-03-28 23:40:24.087551: E tensorflow/stream_executor/cuda/cuda_driver.cc:924] failed to alloc 5072269824 bytes on host: CUDA_ERROR_OUT_OF_MEMORY
2017-03-28 23:40:24.087658: W ./tensorflow/core/common_runtime/gpu/pool_allocator.h:195] could not allocate pinned host memory of size: 5072269824
Killed

我知道GPU不会清理内存，这可能是可用内存越来越少的原因。但是我怎么解决这个问题呢？我的课程正在培养一种深度学习模式，我不认为我可以在训练期间清理记忆。

PS：我不确定是否应该将程序设置为最多使用一部分内存或将程序设置为allow_growth == True。

运行Tensor-Flow时，out_of_memory的Cuda错误

0 个答案: