Kubeflow中的分布式Tensorflow-NotFoundError

时间:2019-05-27 08:49:33

标签: tensorflow kubeflow

我遵循tutorial在GCP上构建kubeflow。

最后一步,在部署the code并进行CPU培训之后。

kustomize build . |kubectl apply -f -

分布式张量流遇到此问题

  

tensorflow.python.framework.errors_impl.NotFoundError:   /tmp/tmprIn1Il/model.ckpt-1_temp_a890dac1971040119aba4921dd5f631a;没有   这样的文件或目录
  [[节点:save / SaveV2 =   SaveV2 [dtypes = [DT_FLOAT,DT_FLOAT,DT_FLOAT,DT_FLOAT,DT_FLOAT,   DT_FLOAT,DT_FLOAT,DT_FLOAT,DT_INT64],   _device =“ / job:ps / replica:0 / task:0 / device:CPU:0”](保存/共享文件名,保存/ SaveV2 / tensor_names,保存/ SaveV2 / shape_and_slices,   conv_layer1 / conv2d / bias,conv_layer1 / conv2d / kernel,   conv_layer2 / conv2d / bias,conv_layer2 / conv2d / kernel,密集/ bias,   密/内核,密_1 /偏差,密_1 /内核,global_step)]]

我发现了类似的bug report,但不知道如何解决。

1 个答案:

答案 0 :(得分:0)

来自错误报告。

  

您可以通过使用共享文件系统来解决此问题(例如   HDFS,GCS或NFS安装在同一安装点上)   参数服务器。

只需将数据放在GCS上就可以了。

model.py

import tensorflow_datasets as tfds
import tensorflow as tf

# tfds works in both Eager and Graph modes
tf.enable_eager_execution()

# See available datasets
print(tfds.list_builders())

ds_train, ds_test = tfds.load(name="mnist", split=["train", "test"], data_dir="gs://kubeflow-tf-bucket", batch_size=-1)
ds_train = tfds.as_numpy(ds_train)
ds_test = tfds.as_numpy(ds_test)

(x_train, y_train) = ds_train['image'], ds_train['label']
(x_test, y_test) = ds_test['image'], ds_test['label']
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28, 1)),
  tf.keras.layers.Dense(512, activation=tf.nn.relu),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=5)
print(model.evaluate(x_test, y_test))