在GKE上使用TPU:从进给记录的错误:套接字已关闭

时间:2019-05-16 18:26:45

标签: tensorflow google-kubernetes-engine tpu

基于GKE TPUEstimator的使用TPU的培训工作有时会失败:

Error recorded from infeed: Socket closed
An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. This error may also occur due to a gRPC failure caused by high memory or network bandwidth usage in the parameter servers. If this error occurs repeatedly, try increasing the number of parameter servers assigned to the job. Error: Socket closed

对此我有两个问题:

  1. 这是怎么回事?我检查了Pod的内存使用情况,但没有出现峰值。分配给该Pod的TPU仍然在那里。
  2. 该工作并不总是会向pod引发错误。除非有人手动检查状态,然后采取措施将其重新启动,否则它将继续显示为正在运行。有什么办法可以使其始终自动重启?

0 个答案:

没有答案