使用TPUEstimator的TensorFlow 2.1:RuntimeError:从TPU超出的所有张量都应保留批大小尺寸,但标量张量

时间:2020-06-08 16:03:47

标签: python-3.x tensorflow2.0 tensorflow-estimator tpu

我刚刚使用TPUEstimator API将现有项目从TF 1.14转换为TF 2.1。进行转换后,本地测试(即 use_tpu = False )成功运行。但是,在Google Cloud TPU(即 use_tpu = True )上运行时出现错误。

注意:这是在AdaNet AutoML框架(v0.8.0)的上下文中,尽管我怀疑这可能是与TPUEstimator相关的一般错误,因为这些错误似乎起源于tpu_estimator.py和error_handling.py脚本。在下面的“回溯”中可以看到:

  File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3032, in train
    rendezvous.record_error('training_loop', sys.exc_info())
  File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 81, in record_error
    if value and value.op and value.op.type == _CHECK_NUMERIC_OP_NAME:
  AttributeError: 'RuntimeError' object has no attribute 'op'

  During handling of the above exception, another exception occurred:  

  File "workspace/trainer/train.py", line 331, in <module>
    main(args=parsed_args)
  File "workspace/trainer/train.py", line 177, in main
    run_config=run_config)
  File "workspace/trainer/train.py", line 68, in run_experiment
    estimator.train(input_fn=train_input_fn, max_steps=total_train_steps)
  File "/usr/local/lib/python3.6/site-packages/adanet/core/estimator.py", line 853, in train
    saving_listeners=saving_listeners)
  File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3035, in train
    rendezvous.raise_errors()
  File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 143, in raise_errors
    six.reraise(typ, value, traceback)
  File "/usr/local/lib/python3.6/site-packages/six.py", line 703, in reraise
    raise value
  File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3030, in train
    saving_listeners=saving_listeners)
  File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 374, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1164, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1194, in _train_model_default
    features, labels, ModeKeys.TRAIN, self.config)
  File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2857, in _call_model_fn
    config)
  File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1152, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3186, in _model_fn
    host_ops = host_call.create_tpu_hostcall()
  File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2226, in create_tpu_hostcall
    'dimension, but got scalar {}'.format(dequeue_ops[i][0]))
RuntimeError: All tensors outfed from TPU should preserve batch size dimension, but got scalar Tensor("OutfeedDequeueTuple:1", shape=(), dtype=int64, device=/job:tpu_worker/task:0/device:CPU:0)'

使用TF 1.14的项目的先前版本在本地和使用TPUEstimator的TPU上均可正常运行。使用TPUEstimator API时,是否有明显的东西可能无法转换为TF 2.1?

1 个答案:

答案 0 :(得分:0)

您是否应用了以下内容:

dataset = ...
dataset = dataset.apply(tf.contrib.data.batch_and_drop_remainder(batch_size))

这可能会从文件中删除最后几个样本,以确保每个批次具有静态的batch_size形状,这是在TPU上进行训练所必需的。

相关问题