运行tensorflow-gpu设备时几乎没有免费的1080 ti内存分配

时间:2018-10-01 02:08:00

标签: tensorflow cuda

我正在通过https://learningtensorflow.com/lesson10/中的简单测试python(matmul.py)程序测试最近购买的ASUS ROG STRIX 1080 ti(11 GB)卡。 虚拟环境(venv)设置如下:ubuntu = 16.04,tensorflow-gpu == 1.5.0,python = 3.6.6,CUDA == 9.0,Cudnn == 7.2.1。

  

发生了CUDA_ERROR_OUT_OF_MEMORY。

最奇怪的是:totalMemory:10.91GiB freeMemory:61.44MiB ..

我不确定这是由于环境设置还是1080 ti本身。如果有任何摘录可以在这里提出建议,我将不胜感激。

终端显示-

(venv) xx@xxxxxx:~/xx$ python matmul.py gpu 1500
2018-10-01 09:05:12.459203: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-10-01 09:05:12.514203: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:895] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-10-01 09:05:12.514445: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.607
pciBusID: 0000:01:00.0
totalMemory: 10.91GiB freeMemory: 61.44MiB
2018-10-01 09:05:12.514471: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
2018-10-01 09:05:12.651207: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 11.44M (11993088 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
......

3 个答案:

答案 0 :(得分:1)

可能会发生Python进程卡在GPU上的情况。始终使用nvidia-smi检查进程,并在必要时手动将其终止。

答案 1 :(得分:1)

我通过限制内存使用量来解决此问题:

def gpu_config():
    config = tf.ConfigProto(
        allow_soft_placement=True, log_device_placement=False)
    config.gpu_options.allow_growth = True
    config.gpu_options.allocator_type = 'BFC'

    config.gpu_options.per_process_gpu_memory_fraction = 0.8
    print("GPU memory upper bound:", upper)
    return config

然后您可以做:

config = gpu_config()
with tf.Session(config=config) as sess:
    ....

答案 2 :(得分:1)

重新启动后,我能够运行tersorflow.org-https://www.tensorflow.org/guide/using_gpu的示例代码而没有内存问题。

在运行tensorflow示例代码以检查1080 ti之前,我在发布Mask-RCNN模型时遇到了困难- Mask RCNN Resource exhausted (OOM) on my own dataset 用7.0.5替换cudnn 7.2.1之后,不再发生资源耗尽(OOM)问题。