当我使用TensorFlow训练VGG16 NN和GPU时,它总是显示CUDA_ERROR_OUT_OF_MEMORY
并始终以tensorflow.python.framework.errors_impl.InternalError: Dst tensor is not initialized.
我用这些消息搜索了互联网并得到了一些提示:
config.gpu_options.allow_growth
设为True
。config.gpu_options.per_process_gpu_memory_fraction
设置为较小的分数,例如0.6
。batch size
。但是这些提示不起作用,这个过程就像没有改变一样。
这是我的硬件:
我使用nvidia-smi
监控GPU的使用情况,以下是详细信息。
跑步前:
Thu Apr 19 14:21:59 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 388.31 Driver Version: 388.31 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1060 WDDM | 00000000:01:00.0 On | N/A |
| N/A 50C P8 7W / N/A | 587MiB / 3072MiB | 2% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 7300 C+G ...osoft Office\root\Office16\POWERPNT.EXE N/A |
| 0 8244 C+G ...6)\Youdao\YoudaoNote\YNoteCefRender.exe N/A |
| 0 9988 C+G C:\Windows\explorer.exe N/A |
| 0 10696 C+G ...t_cw5n1h2txyewy\ShellExperienceHost.exe N/A |
| 0 10808 C+G ...dows.Cortana_cw5n1h2txyewy\SearchUI.exe N/A |
| 0 11024 C+G Insufficient Permissions N/A |
| 0 11092 C+G C:\Windows\System32\mstsc.exe N/A |
| 0 13076 C+G ...ogram Files (x86)\Skype\Phone\Skype.exe N/A |
| 0 14664 C+G ...osoft Office\root\Office16\POWERPNT.EXE N/A |
+-----------------------------------------------------------------------------+
流程开始:
Thu Apr 19 14:24:23 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 388.31 Driver Version: 388.31 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1060 WDDM | 00000000:01:00.0 On | N/A |
| N/A 48C P2 28W / N/A | 1133MiB / 3072MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 7300 C+G ...osoft Office\root\Office16\POWERPNT.EXE N/A |
| 0 9988 C+G C:\Windows\explorer.exe N/A |
| 0 10696 C+G ...t_cw5n1h2txyewy\ShellExperienceHost.exe N/A |
| 0 10808 C+G ...dows.Cortana_cw5n1h2txyewy\SearchUI.exe N/A |
| 0 11024 C+G Insufficient Permissions N/A |
| 0 11092 C+G C:\Windows\System32\mstsc.exe N/A |
| 0 13076 C+G ...ogram Files (x86)\Skype\Phone\Skype.exe N/A |
| 0 14404 C ...ools\Anaconda3\envs\py36_tfg\python.exe N/A |
| 0 14664 C+G ...osoft Office\root\Office16\POWERPNT.EXE N/A |
+-----------------------------------------------------------------------------+
经过10个步骤:
Thu Apr 19 14:30:40 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 388.31 Driver Version: 388.31 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1060 WDDM | 00000000:01:00.0 On | N/A |
| N/A 64C P2 31W / N/A | 2595MiB / 3072MiB | 1% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 7300 C+G ...osoft Office\root\Office16\POWERPNT.EXE N/A |
| 0 9988 C+G C:\Windows\explorer.exe N/A |
| 0 10696 C+G ...t_cw5n1h2txyewy\ShellExperienceHost.exe N/A |
| 0 10808 C+G ...dows.Cortana_cw5n1h2txyewy\SearchUI.exe N/A |
| 0 11024 C+G Insufficient Permissions N/A |
| 0 11092 C+G C:\Windows\System32\mstsc.exe N/A |
| 0 13076 C+G ...ogram Files (x86)\Skype\Phone\Skype.exe N/A |
| 0 14404 C ...ools\Anaconda3\envs\py36_tfg\python.exe N/A |
| 0 14664 C+G ...osoft Office\root\Office16\POWERPNT.EXE N/A |
+-----------------------------------------------------------------------------+
经过60个步骤: 一些消息显示,但仍然可以运行
2018-04-19 14:33:56.384528: E c:\l\work\tensorflow-1.1.0\tensorflow\stream_executor\cuda\cuda_driver.cc:924] failed to alloc 2147483648 bytes on host: CUDA_ERROR_OUT_OF_MEMORY
2018-04-19 14:33:56.423080: E c:\l\work\tensorflow-1.1.0\tensorflow\stream_executor\cuda\cuda_driver.cc:924] failed to alloc 1932735232 bytes on host: CUDA_ERROR_OUT_OF_MEMORY
2018-04-19 14:33:56.474281: E c:\l\work\tensorflow-1.1.0\tensorflow\stream_executor\cuda\cuda_driver.cc:924] failed to alloc 1739461632 bytes on host: CUDA_ERROR_OUT_OF_MEMORY
Thu Apr 19 14:36:13 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 388.31 Driver Version: 388.31 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1060 WDDM | 00000000:01:00.0 On | N/A |
| N/A 63C P2 33W / N/A | 2602MiB / 3072MiB | 43% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 7300 C+G ...osoft Office\root\Office16\POWERPNT.EXE N/A |
| 0 9988 C+G C:\Windows\explorer.exe N/A |
| 0 10696 C+G ...t_cw5n1h2txyewy\ShellExperienceHost.exe N/A |
| 0 10808 C+G ...dows.Cortana_cw5n1h2txyewy\SearchUI.exe N/A |
| 0 11024 C+G Insufficient Permissions N/A |
| 0 11092 C+G C:\Windows\System32\mstsc.exe N/A |
| 0 13076 C+G ...ogram Files (x86)\Skype\Phone\Skype.exe N/A |
| 0 14404 C ...ools\Anaconda3\envs\py36_tfg\python.exe N/A |
| 0 14664 C+G ...osoft Office\root\Office16\POWERPNT.EXE N/A |
+-----------------------------------------------------------------------------+
经过170步:
显示大约八百行消息,然后该过程停止并显示错误
大约八百行:
2018-04-19 14:49:35.688274: E c:\l\work\tensorflow-1.1.0\tensorflow\stream_executor\cuda\cuda_driver.cc:924] failed to alloc 4294967296 bytes on host: CUDA_ERROR_OUT_OF_MEMORY
停止了一些错误:
Traceback (most recent call last):
File "C:\DevTools\Anaconda3\envs\py36_tfg\lib\site-packages\tensorflow\python\client\session.py", line 1039, in _do_call
return fn(*args)
File "C:\DevTools\Anaconda3\envs\py36_tfg\lib\site-packages\tensorflow\python\client\session.py", line 1021, in _run_fn
status, run_metadata)
File "C:\DevTools\Anaconda3\envs\py36_tfg\lib\contextlib.py", line 88, in __exit__
next(self.gen)
File "C:\DevTools\Anaconda3\envs\py36_tfg\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InternalError: Dst tensor is not initialized.
[[Node: input/input/div/_79 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_111_input/input/div", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "vgg16_train_and_test.py", line 212, in <module>
train()
File "vgg16_train_and_test.py", line 124, in train
coord.join(threads)
File "C:\DevTools\Anaconda3\envs\py36_tfg\lib\site-packages\tensorflow\python\training\coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "C:\DevTools\Anaconda3\envs\py36_tfg\lib\site-packages\six.py", line 693, in reraise
raise value
File "C:\DevTools\Anaconda3\envs\py36_tfg\lib\site-packages\tensorflow\python\training\queue_runner_impl.py", line 234, in _run
sess.run(enqueue_op)
File "C:\DevTools\Anaconda3\envs\py36_tfg\lib\site-packages\tensorflow\python\client\session.py", line 778, in run
run_metadata_ptr)
File "C:\DevTools\Anaconda3\envs\py36_tfg\lib\site-packages\tensorflow\python\client\session.py", line 982, in _run
feed_dict_string, options, run_metadata)
File "C:\DevTools\Anaconda3\envs\py36_tfg\lib\site-packages\tensorflow\python\client\session.py", line 1032, in _do_run
target_list, options, run_metadata)
File "C:\DevTools\Anaconda3\envs\py36_tfg\lib\site-packages\tensorflow\python\client\session.py", line 1052, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Dst tensor is not initialized.
[[Node: input/input/div/_79 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_111_input/input/div", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]