我最近将我的代码从本地服务器转移到支持GPU的服务器上,并且遇到了一个奇怪的OOM错误。通过消除,问题似乎是TF Keras指标。我的代码现在已减少为
import tensorflow as tf
METRICS = [
tf.keras.metrics.Precision(name='precision')
]
...但是我仍然遇到OOM错误。没有其他进程正在运行。我正在docker容器(tensorflow / tensorflow:latest-gpu-py3)btw中执行此操作,这可能是问题所在,但我找不到合适的参数来更改。
非常感谢您的帮助!
版本:Docker 17.12.1-ce,TF 2.1.0,Keras 2.3.1
Docker命令:
docker run --runtime=nvidia -it --rm -v tensorflow/tensorflow:latest-gpu-py3 bash
nvidia-smi输出:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 435.21 Driver Version: 435.21 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1080 Off | 00000000:17:00.0 Off | N/A |
| 54% 68C P2 137W / 200W | 7931MiB / 8119MiB | 81% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 1080 Off | 00000000:65:00.0 Off | N/A |
| 33% 34C P8 10W / 200W | 115MiB / 8119MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
整个错误输出如下:
2020-04-20 11:05:08.088874: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.6
2020-04-20 11:05:08.090195: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.6
2020-04-20 11:05:08.747745: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-04-20 11:05:08.751503: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:17:00.0 name: GeForce GTX 1080 computeCapability: 6.1
coreClock: 1.7715GHz coreCount: 20 deviceMemorySize: 7.93GiB deviceMemoryBandwidth: 298.32GiB/s
2020-04-20 11:05:08.751905: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 1 with properties:
pciBusID: 0000:65:00.0 name: GeForce GTX 1080 computeCapability: 6.1
coreClock: 1.7715GHz coreCount: 20 deviceMemorySize: 7.93GiB deviceMemoryBandwidth: 298.32GiB/s
2020-04-20 11:05:08.751936: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-04-20 11:05:08.751963: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-04-20 11:05:08.753290: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-04-20 11:05:08.753551: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-04-20 11:05:08.754983: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-04-20 11:05:08.755747: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-04-20 11:05:08.755786: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-04-20 11:05:08.757022: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0, 1
2020-04-20 11:05:08.757267: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-04-20 11:05:08.782042: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3600000000 Hz
2020-04-20 11:05:08.783237: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5dd24e0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-04-20 11:05:08.783274: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-04-20 11:05:08.996600: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5e37ca0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-04-20 11:05:08.996653: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): GeForce GTX 1080, Compute Capability 6.1
2020-04-20 11:05:08.996670: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (1): GeForce GTX 1080, Compute Capability 6.1
2020-04-20 11:05:08.998089: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:17:00.0 name: GeForce GTX 1080 computeCapability: 6.1
coreClock: 1.7715GHz coreCount: 20 deviceMemorySize: 7.93GiB deviceMemoryBandwidth: 298.32GiB/s
2020-04-20 11:05:08.999119: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 1 with properties:
pciBusID: 0000:65:00.0 name: GeForce GTX 1080 computeCapability: 6.1
coreClock: 1.7715GHz coreCount: 20 deviceMemorySize: 7.93GiB deviceMemoryBandwidth: 298.32GiB/s
2020-04-20 11:05:08.999175: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-04-20 11:05:08.999200: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-04-20 11:05:08.999241: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-04-20 11:05:08.999270: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-04-20 11:05:08.999298: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-04-20 11:05:08.999327: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-04-20 11:05:08.999359: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-04-20 11:05:09.004066: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0, 1
2020-04-20 11:05:09.004172: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-04-20 11:05:09.561399: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-04-20 11:05:09.561437: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] 0 1
2020-04-20 11:05:09.561442: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0: N Y
2020-04-20 11:05:09.561446: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 1: Y N
2020-04-20 11:05:09.562474: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 37 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:17:00.0, compute capability: 6.1)
2020-04-20 11:05:09.563399: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 7460 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080, pci bus id: 0000:65:00.0, compute capability: 6.1)
2020-04-20 11:05:09.570968: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 37.56M (39387136 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-04-20 11:05:09.572125: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 33.81M (35448576 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
Traceback (most recent call last):
File "sample.py", line 4, in <module>
tf.keras.metrics.Precision(name='precision')
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/metrics.py", line 1186, in __init__
initializer=init_ops.zeros_initializer)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/metrics.py", line 276, in add_weight
aggregation=aggregation)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/base_layer.py", line 446, in add_weight
caching_device=caching_device)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/tracking/base.py", line 744, in _add_variable_with_custom_getter
**kwargs_for_getter)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/base_layer_utils.py", line 142, in make_variable
shape=variable_shape if variable_shape else None)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variables.py", line 258, in __call__
return cls._variable_v1_call(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variables.py", line 219, in _variable_v1_call
shape=shape)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variables.py", line 197, in <lambda>
previous_getter = lambda **kwargs: default_variable_creator(None, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variable_scope.py", line 2596, in default_variable_creator
shape=shape)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variables.py", line 262, in __call__
return super(VariableMetaclass, cls).__call__(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py", line 1411, in __init__
distribute_strategy=distribute_strategy)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py", line 1557, in _init_from_args
graph_mode=self._in_graph_mode)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py", line 232, in eager_safe_variable_handle
shape, dtype, shared_name, name, graph_mode, initial_value)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py", line 164, in _variable_handle_from_shape_and_dtype
math_ops.logical_not(exists), [exists], name="EagerVariableNameReuse")
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_logging_ops.py", line 55, in _assert
_ops.raise_from_not_ok_status(e, name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 6606, in raise_from_not_ok_status
six.raise_from(core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError: assertion failed: [0] [Op:Assert] name: EagerVariableNameReuse