我试图用类似于Inception's块的链式块训练模型。这个型号在我的CPU上运行良好(尽管它有点慢)。当尝试在ml-engine上训练相同模型时,作业继续运行,日志中不会出现错误,但是训练没有进展。
def model_fn(features, labels, mode):
x = tf.reshape(features, [-1, 199, 199, 1])
# mini inception block num 1
conv = tf.layers.conv2d(x, filters=16, kernel_size=1, padding='same', activation=tf.nn.relu)
mp = tf.layers.max_pooling2d(x, pool_size=3, strides=1, padding='same')
mp = tf.layers.conv2d(mp, filters=16, kernel_size=1, padding='same', activation=tf.nn.relu)
concat = tf.concat([conv, mp], axis=3)
# mini inception block num 2
conv = tf.layers.conv2d(concat, filters=32, kernel_size=1, padding='same', activation=tf.nn.relu)
mp = tf.layers.max_pooling2d(concat, pool_size=3, strides=1, padding='same')
mp = tf.layers.conv2d(mp, filters=32, kernel_size=1, padding='same', activation=tf.nn.relu)
concat = tf.concat([conv, mp], axis=3)
# ...
estimator = tf.estimator.Estimator(model_dir=model_dir, model_fn=model_fn)
train_spec = tf.estimator.TrainSpec(input_fn=train_input_fn)
eval_spec = tf.estimator.EvalSpec(input_fn=eval_input_fn)
tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
trainingInput:
scaleTier: CUSTOM
masterType: standard_p100
workerType: complex_model_m_gpu
parameterServerType: large_model
workerCount: 1
parameterServerCount: 1
(我已尝试workerCount
高达4,parameterServerCount
高达2)
这是特定工作日志的底部:
为什么训练会被卡住?
作为旁注,似乎Tensorflow发布的一些日志语句被错误标记为错误,而它应该是信息。示例:
错误:I tensorflow / core / distributed_runtime / rpc / grpc_channel.cc:215]初始化作业主的GrpcChannelCache - > {0 - >本地主机:2222}
经过一些测试和实验后,如果我将一个简单的tf.reshape()
应用于从块1输出的tf.concat()
,然后再将其用作输入,那么ml-engine作业似乎不会卡住块2的实际形状保持不变,但工作继续进行。