我试图让一个简单的估算器在Google Machine Learning Engine上的2台以上的机器上进行训练。无法弄清楚如何指示设备放置。
def model_fn(features, labels, mode, params):
x = tf.reshape(features, [-1, 199, 199, 3])
with tf.device('/cpu:0'):
conv1 = tf.layers.conv2d(x, filters=16, kernel_size=3, activation=tf.nn.relu)
pool1 = tf.layers.max_pooling2d(conv1, pool_size=[2, 2], strides=2)
with tf.device('/cpu:0'):
conv2 = tf.layers.conv2d(pool1, filters=32, kernel_size=3, activation=tf.nn.relu)
pool2 = tf.layers.max_pooling2d(conv2, pool_size=[2, 2], strides=2)
trainingInput:
scaleTier: CUSTOM
masterType: standard_gpu
workerType: standard_gpu
parameterServerType: large_model
workerCount: 1
parameterServerCount: 1
作业已创建,但很快就会遇到错误日志中的以下内容(并且没有任何内容保存到model_dir
)
每个设备(worker-replica-0
,master-replica-0
和ps-replica-0
)的错误消息都是相同的:
# ps and worker: -------------v
message: " grpc epoll fd: 3"
# master ---------------------v
message: " grpc epoll fd: 4"
我已删除了with tf.device
语句,并显示相同的错误。我尝试过部署到不同的扩展类型,但是任何一台机器(CPU或GPU)都失败了。