我正在使用g3.xLarge实例进行某些图像处理工作。直到昨天晚上,它一直运行没有问题,当时我突然开始遇到其中一个GPU的以下错误
failed to allocate 155.56M (163119104 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
我尝试使用清理缓存
sync; echo 1 >/proc/sys/vm/drop_caches
swapoff -a && swapon -a
我尝试从计算机中删除〜/ .nv缓存文件夹 我尝试将卷大小从100G增加到300G。
但是仍然无法解决。
nvidia-smi提供以下输出:
Wed May 15 10:03:20 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.130 Driver Version: 384.130 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M60 Off | 00000000:00:1B.0 Off | 0 |
| N/A 60C P0 122W / 150W | 7382MiB / 7613MiB | 85% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla M60 Off | 00000000:00:1C.0 Off | 0 |
| N/A 61C P0 40W / 150W | 7238MiB / 7613MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla M60 Off | 00000000:00:1D.0 Off | 0 |
| N/A 37C P0 38W / 150W | 7238MiB / 7613MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla M60 Off | 00000000:00:1E.0 Off | 0 |
| N/A 44C P0 38W / 150W | 7238MiB / 7613MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 17799 C python3 7369MiB |
| 1 17799 C python3 7225MiB |
| 2 17799 C python3 7225MiB |
| 3 17799 C python3 7225MiB |
+-----------------------------------------------------------------------------+
WARNING:tensorflow:From /home/ubuntu/models/research/object_detection/trainer.py:210: create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.create_global_step
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:Scale of 0 disables regularizer.
WARNING:tensorflow:From /home/ubuntu/models/research/object_detection/meta_architectures/faster_rcnn_meta_arch.py:1670: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
INFO:tensorflow:Summary name /clone_loss is illegal; using clone_loss instead.
2019-05-15 09:41:04.173814: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-05-15 09:41:04.344127: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-05-15 09:41:04.344601: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
pciBusID: 0000:00:1b.0
totalMemory: 7.43GiB freeMemory: 155.56MiB
2019-05-15 09:41:04.485790: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-05-15 09:41:04.486161: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 1 with properties:
name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
pciBusID: 0000:00:1c.0
totalMemory: 7.43GiB freeMemory: 299.69MiB
2019-05-15 09:41:04.608502: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-05-15 09:41:04.608899: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 2 with properties:
name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
pciBusID: 0000:00:1d.0
totalMemory: 7.43GiB freeMemory: 299.69MiB
2019-05-15 09:41:04.764843: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-05-15 09:41:04.765258: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 3 with properties:
name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
pciBusID: 0000:00:1e.0
totalMemory: 7.43GiB freeMemory: 299.69MiB
2019-05-15 09:41:04.765454: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Device peer to peer matrix
2019-05-15 09:41:04.765611: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1051] DMA: 0 1 2 3
2019-05-15 09:41:04.765628: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 0: Y N N N
2019-05-15 09:41:04.765637: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 1: N Y N N
2019-05-15 09:41:04.765645: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 2: N N Y N
2019-05-15 09:41:04.765653: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 3: N N N Y
2019-05-15 09:41:04.765677: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla M60, pci bus id: 0000:00:1b.0, compute capability: 5.2)
2019-05-15 09:41:04.765688: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: Tesla M60, pci bus id: 0000:00:1c.0, compute capability: 5.2)
2019-05-15 09:41:04.765698: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:2) -> (device: 2, name: Tesla M60, pci bus id: 0000:00:1d.0, compute capability: 5.2)
2019-05-15 09:41:04.765707: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:3) -> (device: 3, name: Tesla M60, pci bus id: 0000:00:1e.0, compute capability: 5.2)
2019-05-15 09:41:04.786085: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 155.56M (163119104 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY