在multi_gpu_model上的TF2.0.0b1中进行训练不起作用

时间:2019-09-28 18:08:44

标签: tensorflow keras tf.keras

我想用multi_gpu_model()函数训练我的模型。但是,这不起作用 我收到以下错误

ValueError: To call `multi_gpu_model` with `gpus=2`, we expect the following devices to be available: ['/cpu:0', '/gpu:0', '/gpu:1']. However this machine only has: ['/cpu:0', '/xla_gpu:0', '/xla_gpu:1', '/xla_gpu:2', '/xla_gpu:3', '/xla_cpu:0']. Try reducing `gpus`.

有没有办法解决这个问题?

对于PC规格:我正在使用AWS EC2 AMI Deep Learning AMI (Ubuntu) Version 24.2 (ami-0ba6d589ad99d7604)

使用MirroredStrategy()会产生不同的错误,有些东西无法解压缩sample_weights。

Train on 10 steps, validate on 10 steps
Traceback (most recent call last):
  File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/train/code/sc-ai-model-training/src/main/scripts/run_training.py", line 94, in <module>
    main()
  File "/home/train/code/sc-ai-model-training/src/main/scripts/run_training.py", line 79, in main
    verbose=1)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/engine/training.py", line 643, in fit
    use_multiprocessing=use_multiprocessing)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/engine/training_distributed.py", line 681, in fit
    steps_name='steps_per_epoch')
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/engine/training_arrays.py", line 294, in model_iteration
    batch_outs = f(actual_inputs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/distribute/distributed_training_utils.py", line 814, in execution_function
    return [out.numpy() for out in distributed_function(input_fn)]
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/eager/def_function.py", line 416, in __call__
    self._initialize(args, kwds, add_initializers_to=initializer_map)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/eager/def_function.py", line 359, in _initialize
    *args, **kwds))
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/eager/function.py", line 1360, in _get_concrete_function_internal_garbage_collected
    graph_function, _, _ = self._maybe_define_function(args, kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/eager/function.py", line 1648, in _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/eager/function.py", line 1541, in _create_graph_function
    capture_by_value=self._capture_by_value),
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/func_graph.py", line 716, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/eager/def_function.py", line 309, in wrapped_fn
    return weak_wrapped_fn().__wrapped__(*args, **kwds)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/func_graph.py", line 706, in wrapper
    raise e.ag_error_metadata.to_exception(type(e))
ValueError: in converted code:
    relative to /usr/local/lib/python3.5/dist-packages/tensorflow/python/keras:

    distribute/distributed_training_utils.py:799 distributed_function  *
        x, y, sample_weights = input_fn()
    engine/training_arrays.py:506 get_distributed_inputs
        model, inputs, targets, sample_weights, mode)
    distribute/distributed_training_utils.py:580 _prepare_feed_values
        inputs, targets, sample_weights = _get_input_from_iterator(inputs, model)
    distribute/distributed_training_utils.py:558 _get_input_from_iterator
        x, y, sample_weights = next_element

    ValueError: not enough values to unpack (expected 3, got 2)

0 个答案:

没有答案