Question

我试图用类似于Inception's块的链式块训练模型。这个型号在我的CPU上运行良好（尽管它有点慢）。当尝试在ml-engine上训练相同模型时，作业继续运行，日志中不会出现错误，但是训练没有进展。

我的模特

def model_fn(features, labels, mode):
  x = tf.reshape(features, [-1, 199, 199, 1])

  # mini inception block num 1
  conv = tf.layers.conv2d(x, filters=16, kernel_size=1, padding='same', activation=tf.nn.relu)
  mp = tf.layers.max_pooling2d(x, pool_size=3, strides=1, padding='same')
  mp = tf.layers.conv2d(mp, filters=16, kernel_size=1, padding='same', activation=tf.nn.relu)
  concat = tf.concat([conv, mp], axis=3)

  # mini inception block num 2
  conv = tf.layers.conv2d(concat, filters=32, kernel_size=1, padding='same', activation=tf.nn.relu)
  mp = tf.layers.max_pooling2d(concat, pool_size=3, strides=1, padding='same')
  mp = tf.layers.conv2d(mp, filters=32, kernel_size=1, padding='same', activation=tf.nn.relu)
  concat = tf.concat([conv, mp], axis=3)

  # ...

我的教练任务

estimator = tf.estimator.Estimator(model_dir=model_dir, model_fn=model_fn)
train_spec = tf.estimator.TrainSpec(input_fn=train_input_fn)
eval_spec = tf.estimator.EvalSpec(input_fn=eval_input_fn)

tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

我的ml-engine config

trainingInput:
 scaleTier: CUSTOM
 masterType: standard_p100
 workerType: complex_model_m_gpu
 parameterServerType: large_model
 workerCount: 1
 parameterServerCount: 1

（我已尝试workerCount高达4，parameterServerCount高达2）

这是特定工作日志的底部：

问题

为什么训练会被卡住？

作为旁注，似乎Tensorflow发布的一些日志语句被错误标记为错误，而它应该是信息。示例：

错误：I tensorflow / core / distributed_runtime / rpc / grpc_channel.cc：215]初始化作业主的GrpcChannelCache - ＆gt; {0 - ＆gt;本地主机：2222}

修改

经过一些测试和实验后，如果我将一个简单的tf.reshape()应用于从块1输出的tf.concat()，然后再将其用作输入，那么ml-engine作业似乎不会卡住块2的实际形状保持不变，但工作继续进行。

Google ML Engine中的类似初始化块的培训变得无法响应

我的模特

我的教练任务

我的ml-engine config

问题

修改

0 个答案: