Question

我正在使用g3.xLarge实例进行某些图像处理工作。直到昨天晚上，它一直运行没有问题，当时我突然开始遇到其中一个GPU的以下错误

failed to allocate 155.56M (163119104 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY

我尝试使用清理缓存

sync; echo 1 >/proc/sys/vm/drop_caches
swapoff -a && swapon -a

我尝试从计算机中删除〜/ .nv缓存文件夹我尝试将卷大小从100G增加到300G。

但是仍然无法解决。

nvidia-smi提供以下输出：

Wed May 15 10:03:20 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.130                Driver Version: 384.130                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M60           Off  | 00000000:00:1B.0 Off |                    0 |
| N/A   60C    P0   122W / 150W |   7382MiB /  7613MiB |     85%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M60           Off  | 00000000:00:1C.0 Off |                    0 |
| N/A   61C    P0    40W / 150W |   7238MiB /  7613MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla M60           Off  | 00000000:00:1D.0 Off |                    0 |
| N/A   37C    P0    38W / 150W |   7238MiB /  7613MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla M60           Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   44C    P0    38W / 150W |   7238MiB /  7613MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     17799      C   python3                                     7369MiB |
|    1     17799      C   python3                                     7225MiB |
|    2     17799      C   python3                                     7225MiB |
|    3     17799      C   python3                                     7225MiB |
+-----------------------------------------------------------------------------+



WARNING:tensorflow:From /home/ubuntu/models/research/object_detection/trainer.py:210: create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
    Instructions for updating:
    Please switch to tf.train.create_global_step
    INFO:tensorflow:Scale of 0 disables regularizer.
    INFO:tensorflow:Scale of 0 disables regularizer.
    INFO:tensorflow:Scale of 0 disables regularizer.
    INFO:tensorflow:depth of additional conv before box predictor: 0
    INFO:tensorflow:Scale of 0 disables regularizer.
    WARNING:tensorflow:From /home/ubuntu/models/research/object_detection/meta_architectures/faster_rcnn_meta_arch.py:1670: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
    Instructions for updating:
    Please switch to tf.train.get_or_create_global_step
    INFO:tensorflow:Summary name /clone_loss is illegal; using clone_loss instead.
    2019-05-15 09:41:04.173814: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
    2019-05-15 09:41:04.344127: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
    2019-05-15 09:41:04.344601: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties: 
    name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
    pciBusID: 0000:00:1b.0
    totalMemory: 7.43GiB freeMemory: 155.56MiB
    2019-05-15 09:41:04.485790: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
    2019-05-15 09:41:04.486161: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 1 with properties: 
    name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
    pciBusID: 0000:00:1c.0
    totalMemory: 7.43GiB freeMemory: 299.69MiB
    2019-05-15 09:41:04.608502: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
    2019-05-15 09:41:04.608899: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 2 with properties: 
    name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
    pciBusID: 0000:00:1d.0
    totalMemory: 7.43GiB freeMemory: 299.69MiB
    2019-05-15 09:41:04.764843: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
    2019-05-15 09:41:04.765258: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 3 with properties: 
    name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
    pciBusID: 0000:00:1e.0
    totalMemory: 7.43GiB freeMemory: 299.69MiB
    2019-05-15 09:41:04.765454: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Device peer to peer matrix
    2019-05-15 09:41:04.765611: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1051] DMA: 0 1 2 3 
    2019-05-15 09:41:04.765628: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 0:   Y N N N 
    2019-05-15 09:41:04.765637: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 1:   N Y N N 
    2019-05-15 09:41:04.765645: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 2:   N N Y N 
    2019-05-15 09:41:04.765653: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 3:   N N N Y 
    2019-05-15 09:41:04.765677: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla M60, pci bus id: 0000:00:1b.0, compute capability: 5.2)
    2019-05-15 09:41:04.765688: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: Tesla M60, pci bus id: 0000:00:1c.0, compute capability: 5.2)
    2019-05-15 09:41:04.765698: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:2) -> (device: 2, name: Tesla M60, pci bus id: 0000:00:1d.0, compute capability: 5.2)
    2019-05-15 09:41:04.765707: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:3) -> (device: 3, name: Tesla M60, pci bus id: 0000:00:1e.0, compute capability: 5.2)
    2019-05-15 09:41:04.786085: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 155.56M (163119104 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY

CUDA_ERROR_OUT_OF_MEMORY-在g3.xLarge实例上的4个GPU中的1个上

0 个答案: