Question

我的GPU信息如下。

+-----------------------------------------------------------------------------+                                      
| NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: 10.0     |                                      
|-------------------------------+----------------------+----------------------+                                       
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |                                        
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |                                         
|===============================+======================+======================|                                         
|   0  GeForce GTX 750 Ti  Off  | 00000000:01:00.0  On |                  N/A |                                          
| 34%   51C    P0     2W /  38W |   1909MiB /  1993MiB |      0%      Default |                                           
+-------------------------------+----------------------+----------------------+                                           

+-----------------------------------------------------------------------------+                                             
| Processes:                                                       GPU Memory |                                              
|  GPU       PID   Type   Process name                             Usage      |                                                
|=============================================================================|                                                
|    0      3492      C   python                                      1467MiB |                                                
|    0      7875      G   ...yCharm-C/ch-0/193.5233.109/jbr/bin/java     2MiB |                                                 
|    0     30812      G   /usr/lib/xorg/Xorg                           163MiB |                                                  
|    0     31133      G   kwin_x11                                      25MiB |                                                  
|    0     31137      G   /usr/bin/krunner                               1MiB |
|    0     31139      G   /usr/bin/plasmashell                          55MiB |
|    0     31536      G   ...uest-channel-token=13296030830960435903   176MiB |
+-----------------------------------------------------------------------------+

当我在此处运行mnist教程时： https://www.tensorflow.org/tutorials/quickstart/beginner

我收到此错误：

2019-12-10 00:27:06.891510: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 115 MB memory) -> physical GPU (device: 0, name: GeForce GTX 750 Ti, pci bus id: 0000:01:00.0, compute capability: 5.0)
2019-12-10 00:27:06.894510: I tensorflow/stream_executor/cuda/cuda_driver.cc:830] failed to allocate 115.56M (121176064 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-12-10 00:27:22.271281: F ./tensorflow/core/kernels/random_op_gpu.h:227] Non-OK-status: GpuLaunchKernel(FillPhiloxRandomKernelLaunch<Distribution>, num_blocks, block_size, 0, d.stream(), gen, data, size, dist) status: Internal: out of memory

我在Unbuntu上使用TF-2。我有两个问题： 1）我的Ubuntu有64G内存，而我的GPU有2G内存。当它报告错误“内存不足”时，是因为训练仅使用GPU的内存，而不是64G？

2）如何解决内存不足错误？

Answer 1

是的，训练使用GPU内存，因为您在训练时将数据输入到GPU。

问题是您使用的视频卡的视频内存很少。 2GB的VRAM不足以进行深度学习。

我建议您至少使用具有6 GB VRAM的视频卡。

如果无法切换到更好的硬件，则可以选择AWS（Amazon Web Services）或Google Colab使用视频卡。

Answer 2

解决此问题的唯一方法是不使用GPU，您的训练会很慢，但至少会起作用。

CUDA_ERROR_OUT_OF_MEMORY：GPU内存不足

2 个答案: