我想用multi_gpu_model()函数训练我的模型。但是,这不起作用 我收到以下错误
ValueError: To call `multi_gpu_model` with `gpus=2`, we expect the following devices to be available: ['/cpu:0', '/gpu:0', '/gpu:1']. However this machine only has: ['/cpu:0', '/xla_gpu:0', '/xla_gpu:1', '/xla_gpu:2', '/xla_gpu:3', '/xla_cpu:0']. Try reducing `gpus`.
有没有办法解决这个问题?
对于PC规格:我正在使用AWS EC2 AMI Deep Learning AMI (Ubuntu) Version 24.2 (ami-0ba6d589ad99d7604)
使用MirroredStrategy()会产生不同的错误,有些东西无法解压缩sample_weights。
Train on 10 steps, validate on 10 steps
Traceback (most recent call last):
File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/train/code/sc-ai-model-training/src/main/scripts/run_training.py", line 94, in <module>
main()
File "/home/train/code/sc-ai-model-training/src/main/scripts/run_training.py", line 79, in main
verbose=1)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/engine/training.py", line 643, in fit
use_multiprocessing=use_multiprocessing)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/engine/training_distributed.py", line 681, in fit
steps_name='steps_per_epoch')
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/engine/training_arrays.py", line 294, in model_iteration
batch_outs = f(actual_inputs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/distribute/distributed_training_utils.py", line 814, in execution_function
return [out.numpy() for out in distributed_function(input_fn)]
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/eager/def_function.py", line 416, in __call__
self._initialize(args, kwds, add_initializers_to=initializer_map)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/eager/def_function.py", line 359, in _initialize
*args, **kwds))
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/eager/function.py", line 1360, in _get_concrete_function_internal_garbage_collected
graph_function, _, _ = self._maybe_define_function(args, kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/eager/function.py", line 1648, in _maybe_define_function
graph_function = self._create_graph_function(args, kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/eager/function.py", line 1541, in _create_graph_function
capture_by_value=self._capture_by_value),
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/func_graph.py", line 716, in func_graph_from_py_func
func_outputs = python_func(*func_args, **func_kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/eager/def_function.py", line 309, in wrapped_fn
return weak_wrapped_fn().__wrapped__(*args, **kwds)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/func_graph.py", line 706, in wrapper
raise e.ag_error_metadata.to_exception(type(e))
ValueError: in converted code:
relative to /usr/local/lib/python3.5/dist-packages/tensorflow/python/keras:
distribute/distributed_training_utils.py:799 distributed_function *
x, y, sample_weights = input_fn()
engine/training_arrays.py:506 get_distributed_inputs
model, inputs, targets, sample_weights, mode)
distribute/distributed_training_utils.py:580 _prepare_feed_values
inputs, targets, sample_weights = _get_input_from_iterator(inputs, model)
distribute/distributed_training_utils.py:558 _get_input_from_iterator
x, y, sample_weights = next_element
ValueError: not enough values to unpack (expected 3, got 2)