我正在尝试从 Tensorflow 2 对象检测 API 训练模型,但是当我开始训练过程时,我收到以下错误消息
2021-08-01 08:38:32.187042: W tensorflow/core/common_runtime/bfc_allocator.cc:467] __________________________________________________________________________*****x__****x**_**********
2021-08-01 08:38:32.187117: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at gather_op.cc:158 : Resource exhausted: OOM when allocating tensor with shape[3763200,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
File "Tensorflow/models/research/object_detection/model_main_tf2.py", line 115, in <module>
tf.compat.v1.app.run()
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 303, in run
_run_main(main, args)
File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "Tensorflow/models/research/object_detection/model_main_tf2.py", line 112, in main
record_summaries=FLAGS.record_summaries)
File "/usr/local/lib/python3.7/dist-packages/object_detection/model_lib_v2.py", line 603, in train_loop
train_input, unpad_groundtruth_tensors)
File "/usr/local/lib/python3.7/dist-packages/object_detection/model_lib_v2.py", line 394, in load_fine_tune_checkpoint
_ensure_model_is_built(model, input_dataset, unpad_groundtruth_tensors)
File "/usr/local/lib/python3.7/dist-packages/object_detection/model_lib_v2.py", line 176, in _ensure_model_is_built
labels,
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 1285, in run
return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 2833, in call_for_each_replica
return self._call_for_each_replica(fn, args, kwargs)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/mirrored_strategy.py", line 679, in _call_for_each_replica
self._container_strategy(), fn, args, kwargs)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/mirrored_run.py", line 86, in call_for_each_replica
return wrapped(args, kwargs)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/def_function.py", line 889, in __call__
result = self._call(*args, **kwds)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/def_function.py", line 950, in _call
return self._stateless_fn(*args, **kwds)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/function.py", line 3024, in __call__
filtered_flat_args, captured_inputs=graph_function.captured_inputs) # pylint: disable=protected-access
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/function.py", line 1961, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/function.py", line 596, in call
ctx=ctx)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[3763200,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node MultiLevelMatMulCropAndResize/MultiLevelRoIAlign/GatherV2_1 (defined at /local/lib/python3.7/dist-packages/object_detection/utils/spatial_transform_ops.py:275) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[Op:__inference__dummy_computation_fn_67462]
Errors may have originated from an input operation.
Input Source operations connected to node MultiLevelMatMulCropAndResize/MultiLevelRoIAlign/GatherV2_1:
MultiLevelMatMulCropAndResize/MultiLevelRoIAlign/mul_14 (defined at /local/lib/python3.7/dist-packages/object_detection/utils/spatial_transform_ops.py:274)
Function call stack:
_dummy_computation_fn
我在 API 中尝试了许多模型,但我无法超越此错误消息。我使用 Colab 作为工作环境,我还尝试在具有 NVIDIA GTX 1660TI 和 6GB RAM 的本地机器上开始训练,但同样的错误仍然存在。顺便说一下,当我更改模型的批量大小时(特别是当我将其减小到较低的值,例如 6 或 7 时),错误消息一直在变化(因为它是一个太长的错误消息(超过 5.000 行)我无法分享)。有人能帮我解决这个问题吗?
答案 0 :(得分:0)
这里是错误信息中最重要的部分:
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[3763200,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
您在尝试按 1024 分配形状为 3763200 的张量时内存不足。让我们快速解决这个问题……一个包含 3,763,200 行和 1024 列的矩阵应该有 3,853,516,800(38 亿!)个条目。假设您使用 float32,每个条目是 32 位或 4 个字节,因此有 152 亿个字节,因此您尝试存储的张量是 15.2 GB,并且不适合 6 GB 的 GPU(我不也不认为 Colab 提供 16 GB GPU)。
您考虑更改批次大小是正确的 - 较小的批次应该更容易适应您使用的任何硬件。尝试使用 1 的批量大小 - 这绝对有效! – 并从那里增加,直到您再次看到错误。