Tensorflow:UnavailableError:增加工作时的操作系统错误(分布式GPU模式)

时间:2018-02-07 14:00:42

标签: tensorflow

我正在使用1个namenode和4个datanode运行tensorflowOnSpark 每个datanode有4个Titan xp(所以16 GPUS一起)

你好Tensorflow'对于每个datanode都运行如下

Python 3.5.2 (default, Feb  7 2018, 11:42:44) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
hello = tf.constant('Hello, TensorFlow!')
sess = tf.Session()
print(sess.run(hello))
>>> hello = tf.constant('Hello, TensorFlow!')
>>> sess = tf.Session()
2018-02-07 20:54:37.894085: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1206] Found device 0 with properties: 
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:03:00.0
totalMemory: 11.91GiB freeMemory: 11.71GiB
2018-02-07 20:54:38.239744: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1206] Found device 1 with properties: 
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:04:00.0
totalMemory: 11.91GiB freeMemory: 11.74GiB
2018-02-07 20:54:38.587283: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1206] Found device 2 with properties: 
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:83:00.0
totalMemory: 11.91GiB freeMemory: 11.74GiB
2018-02-07 20:54:38.922914: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1206] Found device 3 with properties: 
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:84:00.0
totalMemory: 11.91GiB freeMemory: 11.74GiB
2018-02-07 20:54:38.927574: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1221] Device peer to peer matrix
2018-02-07 20:54:38.927719: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1227] DMA: 0 1 2 3 
2018-02-07 20:54:38.927739: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1237] 0:   Y Y N N 
2018-02-07 20:54:38.927750: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1237] 1:   Y Y N N 
2018-02-07 20:54:38.927760: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1237] 2:   N N Y Y 
2018-02-07 20:54:38.927774: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1237] 3:   N N Y Y 
2018-02-07 20:54:38.927791: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1300] Adding visible gpu device 0
2018-02-07 20:54:38.927804: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1300] Adding visible gpu device 1
2018-02-07 20:54:38.927816: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1300] Adding visible gpu device 2
2018-02-07 20:54:38.927827: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1300] Adding visible gpu device 3
2018-02-07 20:54:40.194789: I tensorflow/core/common_runtime/gpu/gpu_device.cc:987] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11341 MB memory) -> physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:03:00.0, compute capability: 6.1)
2018-02-07 20:54:40.376314: I tensorflow/core/common_runtime/gpu/gpu_device.cc:987] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 11374 MB memory) -> physical GPU (device: 1, name: TITAN Xp, pci bus id: 0000:04:00.0, compute capability: 6.1)
2018-02-07 20:54:40.556361: I tensorflow/core/common_runtime/gpu/gpu_device.cc:987] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 11374 MB memory) -> physical GPU (device: 2, name: TITAN Xp, pci bus id: 0000:83:00.0, compute capability: 6.1)
2018-02-07 20:54:40.740179: I tensorflow/core/common_runtime/gpu/gpu_device.cc:987] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 11374 MB memory) -> physical GPU (device: 3, name: TITAN Xp, pci bus id: 0000:84:00.0, compute capability: 6.1)
>>> print(sess.run(hello))
b'Hello, TensorFlow!'

对于此硬件设置,具有8个执行程序的mnist示例(导致每个数据节点由两个执行程序着陆)可以运行,某些工作程序可能会遇到错误" OS Error"在自动重试之后,它会正常进入训练状态,最后按预期输出模型。

18/02/07 21:05:36 ERROR Executor: Exception in task 13.0 in stage 0.0 (TID 13)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/local/python3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1350, in _do_call
    return fn(*args)
  File "/usr/local/python3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1320, in _run_fn
    self._extend_graph()
  File "/usr/local/python3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1381, in _extend_graph
    self._session, graph_def.SerializeToString(), status)
  File "/usr/local/python3/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnavailableError: OS Error

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/disk0/yarn/usercache/zhanbohan/appcache/application_1518005864023_0002/container_1518005864023_0002_01_000016/pyspark.zip/pyspark/worker.py", line 177, in main
    process()
  File "/home/disk0/yarn/usercache/zhanbohan/appcache/application_1518005864023_0002/container_1518005864023_0002_01_000016/pyspark.zip/pyspark/worker.py", line 172, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/home/disk0/yarn/usercache/zhanbohan/appcache/application_1518005864023_0002/container_1518005864023_0002_01_000001/pyspark.zip/pyspark/rdd.py", line 2423, in pipeline_func
  File "/home/disk0/yarn/usercache/zhanbohan/appcache/application_1518005864023_0002/container_1518005864023_0002_01_000001/pyspark.zip/pyspark/rdd.py", line 2423, in pipeline_func
  File "/home/disk0/yarn/usercache/zhanbohan/appcache/application_1518005864023_0002/container_1518005864023_0002_01_000001/pyspark.zip/pyspark/rdd.py", line 2423, in pipeline_func
  File "/home/disk0/yarn/usercache/zhanbohan/appcache/application_1518005864023_0002/container_1518005864023_0002_01_000001/pyspark.zip/pyspark/rdd.py", line 2423, in pipeline_func
  File "/home/disk0/yarn/usercache/zhanbohan/appcache/application_1518005864023_0002/container_1518005864023_0002_01_000001/pyspark.zip/pyspark/rdd.py", line 346, in func
  File "/home/disk0/yarn/usercache/zhanbohan/appcache/application_1518005864023_0002/container_1518005864023_0002_01_000001/pyspark.zip/pyspark/rdd.py", line 794, in func

但是,当将执行程序编号调整为16时,上述错误会更频繁地显示,并且没有工作人员可以进入培训,最后作业失败并退出。由于我们有16个GPU,因此预计群集应该将每个执行器转移到一个GPU上。对于单节点张量流测试,可以访问所有GPU。 (在PCI板上它可能只能在0-1和2-3对中访问,但在我看来它不应该影响执行者的通信)

另一个试验是,在我开始运行8个执行器并且每个GPU开始训练的工作之后,我用8个执行器开始另一个工作,第二个工作仍然失败,与16个执行器的情况相同。

16执行程序运行的日志与运行脚本一起在附件中。 Plz检查并帮我找出情况。感谢。

log.16.txt run_mnist.sh.txt

0 个答案:

没有答案