我正在使用tf.estimator.train_and_evaluate(...)进行分布式培训,由第一名工人担任首席,第二名工人进行评估。集群如下,有8个工人和2 ps。
{
"cluster": {
"ps": ["100.77.4.147:61415", "100.77.14.144:52383"],
"chief": ["100.77.14.144:49606"],
"worker": ["100.110.22.203:28312", "100.77.4.147:32299", "100.77.4.147:4950", "100.110.22.203:22196", "100.110.22.203:39327", "100.77.14.144:32888", "100.77.4.147:26919"]
},
"task": {
"index": 0,
"type": "evaluator"
}
}
其他固定工人指数从0到结束
但是,在运行时会发生错误:
// in the chief node has following errors
CreateSession failed because worker /job:worker/replica:0/task:1 returned error: Unavailable: OS Error
CreateSession failed because worker /job:worker/replica:0/task:2 returned error: Unavailable: OS Error
CreateSession failed because worker /job:worker/replica:0/task:3 returned error: Unavailable: OS Error
然后我检查其他工人,发现错误如下
CreateSession still waiting for response from worker: /job:worker/replica:0/task:5
CreateSession still waiting for response from worker: /job:worker/replica:0/task:0
CreateSession still waiting for response from worker: /job:worker/replica:0/task:1
...
我设置了错误的cluster_spec吗?谢谢
答案 0 :(得分:0)
更新:
终于可以了。评估人员不应列入工人清单。 仅供参考。