在 Colab 中训练 TPU 时出现未实现的错误

时间:2021-07-13 02:32:13

标签: tensorflow deep-learning google-colaboratory

在 Colab 中尝试在 TPU 上训练我的模型时

model.fit(train_dataset,
          steps_per_epoch = len(df_train) // config.BATCH_SIZE,
          validation_data = valid_dataset,
          epochs = config.EPOCHS)

我在整个回溯中遇到了这个错误:

UnimplementedError                        Traceback (most recent call last)
<ipython-input-37-92afbe2b5ae5> in <module>()
      2           steps_per_epoch = len(df_train) // config.BATCH_SIZE,
      3           validation_data = valid_dataset,
----> 4           epochs = config.EPOCHS)

13 frames
/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
   1186               logs = tmp_logs  # No error, now safe to assign to logs.
   1187               end_step = step + data_handler.step_increment
-> 1188               callbacks.on_train_batch_end(end_step, logs)
   1189               if self.stop_training:
   1190                 break

/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/callbacks.py in on_train_batch_end(self, batch, logs)
    455     """
    456     if self._should_call_train_batch_hooks:
--> 457       self._call_batch_hook(ModeKeys.TRAIN, 'end', batch, logs=logs)
    458 
    459   def on_test_batch_begin(self, batch, logs=None):

/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/callbacks.py in _call_batch_hook(self, mode, hook, batch, logs)
    315       self._call_batch_begin_hook(mode, batch, logs)
    316     elif hook == 'end':
--> 317       self._call_batch_end_hook(mode, batch, logs)
    318     else:
    319       raise ValueError('Unrecognized hook: {}'.format(hook))

/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/callbacks.py in _call_batch_end_hook(self, mode, batch, logs)
    335       self._batch_times.append(batch_time)
    336 
--> 337     self._call_batch_hook_helper(hook_name, batch, logs)
    338 
    339     if len(self._batch_times) >= self._num_batches_for_timing_check:

/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/callbacks.py in _call_batch_hook_helper(self, hook_name, batch, logs)
    373     for callback in self.callbacks:
    374       hook = getattr(callback, hook_name)
--> 375       hook(batch, logs)
    376 
    377     if self._check_timing:

/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/callbacks.py in on_train_batch_end(self, batch, logs)
   1027 
   1028   def on_train_batch_end(self, batch, logs=None):
-> 1029     self._batch_update_progbar(batch, logs)
   1030 
   1031   def on_test_batch_end(self, batch, logs=None):

/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/callbacks.py in _batch_update_progbar(self, batch, logs)
   1099     if self.verbose == 1:
   1100       # Only block async when verbose = 1.
-> 1101       logs = tf_utils.sync_to_numpy_or_python_type(logs)
   1102       self.progbar.update(self.seen, list(logs.items()), finalize=False)
   1103 

/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/utils/tf_utils.py in sync_to_numpy_or_python_type(tensors)
    517     return t  # Don't turn ragged or sparse tensors to NumPy.
    518 
--> 519   return nest.map_structure(_to_single_numpy_or_python_type, tensors)
    520 
    521 

/usr/local/lib/python3.7/dist-packages/tensorflow/python/util/nest.py in map_structure(func, *structure, **kwargs)
    865 
    866   return pack_sequence_as(
--> 867       structure[0], [func(*x) for x in entries],
    868       expand_composites=expand_composites)
    869 

/usr/local/lib/python3.7/dist-packages/tensorflow/python/util/nest.py in <listcomp>(.0)
    865 
    866   return pack_sequence_as(
--> 867       structure[0], [func(*x) for x in entries],
    868       expand_composites=expand_composites)
    869 

/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/utils/tf_utils.py in _to_single_numpy_or_python_type(t)
    513   def _to_single_numpy_or_python_type(t):
    514     if isinstance(t, ops.Tensor):
--> 515       x = t.numpy()
    516       return x.item() if np.ndim(x) == 0 else x
    517     return t  # Don't turn ragged or sparse tensors to NumPy.

/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py in numpy(self)
   1092     """
   1093     # TODO(slebedev): Consider avoiding a copy for non-CPU or remote tensors.
-> 1094     maybe_arr = self._numpy()  # pylint: disable=protected-access
   1095     return maybe_arr.copy() if isinstance(maybe_arr, np.ndarray) else maybe_arr
   1096 

/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py in _numpy(self)
   1060       return self._numpy_internal()
   1061     except core._NotOkStatusException as e:  # pylint: disable=protected-access
-> 1062       six.raise_from(core._status_to_exception(e.code, e.message), None)  # pylint: disable=protected-access
   1063 
   1064   @property

/usr/local/lib/python3.7/dist-packages/six.py in raise_from(value, from_value)

UnimplementedError: 9 root error(s) found.
  (0) Unimplemented: {{function_node __inference_train_function_88574}} Asked to propagate a dynamic dimension from hlo convolution.24975@{}@2 to hlo %all-reduce.24980 = f32[3,3,<=3,32]{3,2,1,0} all-reduce(f32[3,3,<=3,32]{3,2,1,0} %convolution.24975), replica_groups={{0,1,2,3,4,5,6,7}}, to_apply=%sum.24976, metadata={op_type="CrossReplicaSum" op_name="while/body/_1/while/Adam/CrossReplicaSum"}, which is not implemented.
     [[{{node TPUReplicate/_compile/_18168620323984915962/_4}}]]
     [[while/body/_1/while/strided_slice_1/_253]]
  (1) Unimplemented: {{function_node __inference_train_function_88574}} Asked to propagate a dynamic dimension from hlo convolution.24975@{}@2 to hlo %all-reduce.24980 = f32[3,3,<=3,32]{3,2,1,0} all-reduce(f32[3,3,<=3,32]{3,2,1,0} %convolution.24975), replica_groups={{0,1,2,3,4,5,6,7}}, to_apply=%sum.24976, metadata={op_type="CrossReplicaSum" op_name="while/body/_1/while/Adam/CrossReplicaSum"}, which is not implemented.
     [[{{node TPUReplicate/_compile/_18168620323984915962/_4}}]]
     [[TPUReplicate/_compile/_18168620323984915962/_4/_243]]
  (2) Unimplemented: {{function_node __inference_train_function_88574}} Asked to propagate a dynamic dimension from hlo convolution.24975@{}@2 to hlo %all-reduce.24980 = f32[3,3,<=3,32]{3,2,1,0} all-reduce(f32[3,3,<=3,32]{3,2,1,0} %convolution.24975), replica_groups={{0,1,2,3,4,5,6,7}}, to_apply=%sum.24976, metadata={op_type="CrossReplicaSum" op_name="while/body/_1/while/Adam/CrossReplicaSum"}, which is not implemented.[truncated]

我检查过的东西:

  • 我的数据位于 GCS 存储桶中,可以使用我创建的数据集对象进行检索。 enter image description here
  • 我的模型定义:
with strategy.scope():
  base_model = efn.EfficientNetB0(include_top=False)
  model = tf.keras.Sequential([
                              tf.keras.layers.Input(shape=(config.IMG_SIZE, config.IMG_SIZE, 3)),
                              base_model,
                              tf.keras.layers.GlobalAveragePooling2D(),
                              tf.keras.layers.Dense(5, activation='softmax')
  ])

  model.compile(optimizer = tf.keras.optimizers.Adam(learning_rate=config.LR),
                loss = tf.keras.losses.SparseCategoricalCrossentropy(),
                metrics = [tf.keras.metrics.SparseCategoricalAccuracy()],
                steps_per_execution = 32)

知道为什么会这样。它说要求传播动态维度,但我认为不应该是这种情况。考虑到模型在 GPU 设置中工作(当前会话中存在数据)。

0 个答案:

没有答案