Question

我正在尝试使用tensorflow 2.0在TPU上运行lrcn keras模型。该模型和生成器可在CPU / GPU上工作，但我将其包括在内以供参考。我还初始化了TPU，它是可见的，除了运行.fit（）之外，其他一切看起来都不错：

def frame_generator(self, batch_size, train_test, data_type):
    """Return a generator that we can use to train on. There are
    a couple different things we can return:
    data_type: 'features', 'images'
    """
    # Get the right dataset for the generator.
    train, test = self.split_train_test()
    data = train if train_test == 'train' else test

    #print("Creating %s generator with %d samples." % (train_test, len(data)))

    while 1:
        X, y = [], []

        # Generate batch_size samples.
        for _ in range(batch_size):
            if random.random() < .5:
                # real
                while True:
                    # Get a random sample.
                    sample = random.choice(data)

                    # Get the sequence from disk.
                    (_x,_y) = self.get_extracted_sequence(data_type, sample)

                    if _y==[0,1]:
                        break
            else:
                 # fake
                while True:
                    # Get a random sample.
                    sample = random.choice(data)

                    # Get the sequence from disk.
                    (_x,_y) = self.get_extracted_sequence(data_type, sample)

                    if _y==[1,0]:
                        break

            if _x is None:
                raise ValueError("Can't find sequence. Did you generate them?", sample)

            X.append(_x)
            y.append(_y)

        #yield [np.array(X), np.array(y)], np.array(y)
        yield np.array(X), np.array(y)

train_generator = data.frame_generator(batch_size, 'train', 'images')
val_generator = data.frame_generator(batch_size, 'test', 'images')

optimizer = Adam(lr=1e-5)

with tpu_strategy.scope():
  model = lrcn()
  model.add(tf.keras.layers.Dense(2, activation='softmax'))

  model.compile(loss='binary_crossentropy',
      optimizer=optimizer,
      metrics=['accuracy', tf.compat.v1.losses.log_loss])
  model.summary() 

train_data = tf.data.Dataset.from_generator(lambda:next(train_generator),
                                        (tf.float32, tf.int64),
                                        ([4, 32,299,299,3], [4,2])     
                                      )

val_data = tf.data.Dataset.from_generator(lambda:next(val_generator),
                                        (tf.float32, tf.int64),
                                      ([4, 32,299,299,3], [4,2]) 
                                      )


model.fit(x=train_data, steps_per_epoch=train_steps, validation_steps=test_steps,
      validation_data=val_data,
        epochs=30,
        callbacks=callbacks,
        verbose=1)

在model.fit上，我得到：

以6421.0步进行训练，以1605.0步进行验证

Epoch 1/30

UnavailableError Traceback（最近一次通话最近）在（）中 15个纪元= 30， 16个callbacks = callbacks， ---> 17 verbose = 1）

11张 /usr/local/lib/python3.6/dist-packages/six.py在raise_from（value，from_value）

UnavailableError：通道处于TRANSIENT_FAILURE状态其他GRPC错误信息： {“ created”：“ @ 1584561754.347859160”，“ description”：“频道处于TRANSIENT_FAILURE状态”，“ file”：“ external / grpc / src / core / ext / filters / client_channel / client_channel.cc”，“ file_line”： 2294，“ grpc_status”：14} [操作：__ inference_distributed_function_24182 通道处于状态TRANSIENT_FAILURE“，”文件“：” external / grpc / src / core / ext / filters / client_channel / client_channel.cc“，” file_line“：2294，” grpc_status“：14} [Op：__ inference_distributed_function_10577]

任何想法如何解决？看起来好像在Google的网络端。

更新：

解决方案的一部分是您不应该在colab笔记本中安装带有pip的tensorflow2.1-您应该在其自己的单元格中使用“导入tensorflow”

%tensorflow_version 2.x

这会将TPU版本从1.15更改为> = 2.1

现在，当我运行笔记本时，我会获得更多详细信息：

训练6902.0个步骤，验证1725.0个步骤时代1/30

1/6902 [..............................]-ETA：20:04:55

NotFoundError Traceback（最近一次通话） /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_v2.py以on_epoch（self，epoch，mode） 766尝试： -> 767 yield epoch_logs 终于768：

18帧 NotFoundError：{{function_node __inference_distributed_function_20824}}没有为与节点{{node PyFunc}}兼容的'CPU'设备注册的'PyFunc'OpKernel 。已注册：

 [[PyFunc]]
 [[MultiDeviceIteratorGetNextFromShard]]
 [[RemoteCall]]
 [[IteratorGetNextAsOptional]]

在处理上述异常期间，发生了另一个异常：

KeyError跟踪（最近一次通话最近） _get_file_path中的/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/callbacks.py（自身，时期，日志）第1053章 1054）或multi_worker_util.should_save_checkpoint（）： -> 1055返回self.filepath.format（epoch = epoch + 1，** logs） 1056其他： 1057＃如果这是多工人培训，并且该工人不应该

KeyError：'val_accuracy'

Answer 1

TL / DR

您需要安装一个更新的版本，该版本将在将其发送到TPU之前执行python函数。通过

加载新版本

import requests
import os
url = 'http://' + os.environ['COLAB_TPU_ADDR'].split(':')[0] + ':8475/requestversion/2.2.0-dev20200311'
resp = requests.post(url)
print(resp)
%pip install tf-nightly==2.2.0-dev20200311

来自https://github.com/tensorflow/tensorflow/issues/34346

当您使用Dataset.from_generator（或将生成器传递给Keras，该生成器将在后台对其进行调用）时，数据集会将生成器嵌入其图形中的PyFunc op中，并且每次调用op时，它都会在生成器并获取结果字节。（基本上将Python视为黑匣子。）

当所有内容都在同一台计算机上运行时，这很好，但是问题在于TPU的工作方式是有一台单独的计算机控制TPU（想象中称为TPU主机控制器。^^），而您发送一个TensorFlow图以在TPU上运行事物。因此，包含该PyFunc的图形将发送到TPU，并且TPU无法执行它，因为TPU主机上没有Python。（即使有，它也不会是与本地计算机具有相同状态的相同解释器。）因此，它通过告诉您它无法执行PyFunc op而失败，但不幸的是，它不是以非常清晰的方式执行的。 / p>

Google Colab中TPU的TRANSIENT_ERROR

Epoch 1/30

1/6902 [..............................]-ETA：20:04:55

1 个答案: