我正在尝试使用Google Colab中的TPU创建和训练我的CNN模型。我打算将其用于对猫和猫进行分类。该模型可以在GPU / CPU运行时上运行,但是在TPU运行时上运行时遇到了麻烦。这是创建模型的代码。
我使用flow_from_directory()函数输入我的数据集,这是它的代码
train_datagen = ImageDataGenerator(rescale=1./255)
train_generator = train_datagen.flow_from_directory(
MAIN_DIR,
target_size = (128,128),
batch_size = 50,
class_mode = 'binary'
)
def create_model():
model=Sequential()
model.add(Conv2D(32,(3,3),activation='relu',input_shape=(128,128,3)))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.25))
model.add(Conv2D(64,(3,3),activation='relu'))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.25))
model.add(Conv2D(128,(3,3),activation='relu'))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(512,activation='relu'))
model.add(BatchNormalization())
model.add(Dropout(0.5))
model.add(Dense(2,activation='softmax'))
return model
这是用于在Google Colab上初始化TPU的代码
tf.keras.backend.clear_session()
resolver = tf.distribute.cluster_resolver.TPUClusterResolver('grpc://' + os.environ['COLAB_TPU_ADDR'])
tf.config.experimental_connect_to_cluster(resolver)
# This is the TPU initialization code that has to be at the beginning.
tf.tpu.experimental.initialize_tpu_system(resolver)
print("All devices: ", tf.config.list_logical_devices('TPU'))
strategy = tf.distribute.experimental.TPUStrategy(resolver)
with strategy.scope():
model = create_model()
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3, ),
loss='sparse_categorical_crossentropy',
metrics=['sparse_categorical_accuracy'])
model.fit(
train_generator,
epochs = 5,
)
但是当我运行这段代码时,我被这个错误所吸引:
UnavailableError Traceback (most recent call last)
<ipython-input-15-1970b3405ba3> in <module>()
20 model.fit(
21 train_generator,
---> 22 epochs = 5,
23
24 )
14 frames
/usr/local/lib/python3.6/dist-packages/six.py in raise_from(value, from_value)
UnavailableError: 5 root error(s) found.
(0) Unavailable: {{function_node __inference_train_function_42823}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1598016644.748265484","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3948,"referenced_errors":[{"created":"@1598016644.748262999","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":394,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[cond_11/switch_pred/_107/_78]]
(1) Unavailable: {{function_node __inference_train_function_42823}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1598016644.748265484","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3948,"referenced_errors":[{"created":"@1598016644.748262999","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":394,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[cond_12/switch_pred/_118/_82]]
(2) Unavailable: {{function_node __inference_train_function_42823}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1598016644.748265484","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3948,"referenced_errors":[{"created":"@1598016644.748262999","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":394,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[TPUReplicate/_compile/_7955920754087029306/_4/_266]]
(3) Unavailable: {{function_node __inference_train_function_42823}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1598016644.748265484","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3948,"referenced_errors":[{"created":"@1598016644.748262999","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":394,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[Shape_7/_104]]
(4) Unavailable: {{functi ... [truncated]
我真的不知道该如何解决。我也不知道这些错误是什么意思。
答案 0 :(得分:1)
您遇到了 TPU 的一个已知问题 - 它们不支持 PyFunction。此处的详细信息:#38762、#34346、#39099:
<块引用>抱歉这个问题。 Dataset.from_generator 预计不适用于 TPU,因为它在下面使用 py_function,这与 Cloud TPU 2VM 设置不兼容。如果您想从大型数据集中读取数据,不妨尝试将其具体化到磁盘上并改用 TFRecordDataest。
由于 ImageDataGenerator
也在幕后使用 PyFunction,因此它与 TPU 不兼容。相反,您必须使用 tf.data API 来加载图像。 This tutorial 说明了如何操作。