仅通过声明TF Keras指标即可发现GPU内存不足错误

时间:2020-04-20 11:28:17

标签: docker tensorflow memory keras

我最近将我的代码从本地服务器转移到支持GPU的服务器上,并且遇到了一个奇怪的OOM错误。通过消除,问题似乎是TF Keras指标。我的代码现在已减少为

import tensorflow as tf

METRICS = [
      tf.keras.metrics.Precision(name='precision')
]

...但是我仍然遇到OOM错误。没有其他进程正在运行。我正在docker容器(tensorflow / tensorflow:latest-gpu-py3)btw中执行此操作,这可能是问题所在,但我找不到合适的参数来更改。

非常感谢您的帮助!

版本:Docker 17.12.1-ce,TF 2.1.0,Keras 2.3.1

Docker命令:

docker run --runtime=nvidia -it --rm -v tensorflow/tensorflow:latest-gpu-py3 bash

nvidia-smi输出:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 435.21       Driver Version: 435.21       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    Off  | 00000000:17:00.0 Off |                  N/A |
| 54%   68C    P2   137W / 200W |   7931MiB /  8119MiB |     81%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 1080    Off  | 00000000:65:00.0 Off |                  N/A |
| 33%   34C    P8    10W / 200W |    115MiB /  8119MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

整个错误输出如下:

2020-04-20 11:05:08.088874: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.6
2020-04-20 11:05:08.090195: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.6
2020-04-20 11:05:08.747745: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-04-20 11:05:08.751503: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:17:00.0 name: GeForce GTX 1080 computeCapability: 6.1
coreClock: 1.7715GHz coreCount: 20 deviceMemorySize: 7.93GiB deviceMemoryBandwidth: 298.32GiB/s
2020-04-20 11:05:08.751905: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 1 with properties:
pciBusID: 0000:65:00.0 name: GeForce GTX 1080 computeCapability: 6.1
coreClock: 1.7715GHz coreCount: 20 deviceMemorySize: 7.93GiB deviceMemoryBandwidth: 298.32GiB/s
2020-04-20 11:05:08.751936: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-04-20 11:05:08.751963: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-04-20 11:05:08.753290: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-04-20 11:05:08.753551: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-04-20 11:05:08.754983: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-04-20 11:05:08.755747: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-04-20 11:05:08.755786: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-04-20 11:05:08.757022: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0, 1
2020-04-20 11:05:08.757267: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-04-20 11:05:08.782042: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3600000000 Hz
2020-04-20 11:05:08.783237: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5dd24e0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-04-20 11:05:08.783274: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-04-20 11:05:08.996600: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5e37ca0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-04-20 11:05:08.996653: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce GTX 1080, Compute Capability 6.1
2020-04-20 11:05:08.996670: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (1): GeForce GTX 1080, Compute Capability 6.1
2020-04-20 11:05:08.998089: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:17:00.0 name: GeForce GTX 1080 computeCapability: 6.1
coreClock: 1.7715GHz coreCount: 20 deviceMemorySize: 7.93GiB deviceMemoryBandwidth: 298.32GiB/s
2020-04-20 11:05:08.999119: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 1 with properties:
pciBusID: 0000:65:00.0 name: GeForce GTX 1080 computeCapability: 6.1
coreClock: 1.7715GHz coreCount: 20 deviceMemorySize: 7.93GiB deviceMemoryBandwidth: 298.32GiB/s
2020-04-20 11:05:08.999175: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-04-20 11:05:08.999200: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-04-20 11:05:08.999241: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-04-20 11:05:08.999270: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-04-20 11:05:08.999298: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-04-20 11:05:08.999327: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-04-20 11:05:08.999359: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-04-20 11:05:09.004066: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0, 1
2020-04-20 11:05:09.004172: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-04-20 11:05:09.561399: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-04-20 11:05:09.561437: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102]      0 1
2020-04-20 11:05:09.561442: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0:   N Y
2020-04-20 11:05:09.561446: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 1:   Y N
2020-04-20 11:05:09.562474: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 37 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:17:00.0, compute capability: 6.1)
2020-04-20 11:05:09.563399: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 7460 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080, pci bus id: 0000:65:00.0, compute capability: 6.1)
2020-04-20 11:05:09.570968: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 37.56M (39387136 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-04-20 11:05:09.572125: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 33.81M (35448576 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
Traceback (most recent call last):
  File "sample.py", line 4, in <module>
    tf.keras.metrics.Precision(name='precision')
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/metrics.py", line 1186, in __init__
    initializer=init_ops.zeros_initializer)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/metrics.py", line 276, in add_weight
    aggregation=aggregation)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/base_layer.py", line 446, in add_weight
    caching_device=caching_device)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/tracking/base.py", line 744, in _add_variable_with_custom_getter
    **kwargs_for_getter)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/base_layer_utils.py", line 142, in make_variable
    shape=variable_shape if variable_shape else None)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variables.py", line 258, in __call__
    return cls._variable_v1_call(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variables.py", line 219, in _variable_v1_call
    shape=shape)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variables.py", line 197, in <lambda>
    previous_getter = lambda **kwargs: default_variable_creator(None, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variable_scope.py", line 2596, in default_variable_creator
    shape=shape)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variables.py", line 262, in __call__
    return super(VariableMetaclass, cls).__call__(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py", line 1411, in __init__
    distribute_strategy=distribute_strategy)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py", line 1557, in _init_from_args
    graph_mode=self._in_graph_mode)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py", line 232, in eager_safe_variable_handle
    shape, dtype, shared_name, name, graph_mode, initial_value)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py", line 164, in _variable_handle_from_shape_and_dtype
    math_ops.logical_not(exists), [exists], name="EagerVariableNameReuse")
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_logging_ops.py", line 55, in _assert
    _ops.raise_from_not_ok_status(e, name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 6606, in raise_from_not_ok_status
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError: assertion failed: [0] [Op:Assert] name: EagerVariableNameReuse

0 个答案:

没有答案