tensorflow.python.framework.errors.InvalidArgumentError:WhereOp:计算真实元素数量并写入它们之间的竞争条件

时间:2016-08-24 18:26:24

标签: python tensorflow gpu distributed

我正在使用gpu在不同计算机上运行distributed tensorflow (link) deep mnistasynchronous数据并行度。我拍摄一个纪元的batch_size是1000张图片,我正在运行1000个纪元。我的架构在机器1上有一个ps,在机器2上有worker task_index=0 chief,机器1有worker task_index=1

我真的不明白为什么在机器1上运行worker task_index=1时有时会出现此错误。如果我尝试运行它几次,错误就不会持续存在。

I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so.7.5 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so.7.5 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so.7.5 locally
cannot import name hashtable
cannot import name hashtable
cannot import name hashtable
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:118] Found device 0 with properties: 
name: GeForce GTX 750 Ti
major: 5 minor: 0 memoryClockRate (GHz) 1.124
pciBusID 0000:02:00.0
Total memory: 2.00GiB
Free memory: 125.53MiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:138] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:148] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:868] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 750 Ti, pci bus id: 0000:02:00.0)
E tensorflow/stream_executor/cuda/cuda_driver.cc:965] failed to allocate 125.53M (131629056 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:206] Initialize HostPortsGrpcChannelCache for job ps -> {172.25.1.127:2223}
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:206] Initialize HostPortsGrpcChannelCache for job worker -> {172.25.1.108:2222, localhost:2222}
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:203] Started server with target: grpc://localhost:2222
Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz
Traceback (most recent call last):
  File "dist_trainer_deepMnist_v2.py", line 205, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
    sys.exit(main(sys.argv))
  File "dist_trainer_deepMnist_v2.py", line 157, in main
    with sv.managed_session(server.target) as sess:
  File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 969, in managed_session
    self.stop(close_summary_writer=close_summary_writer)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 797, in stop
    stop_grace_period_secs=self._stop_grace_secs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 386, in join
    six.reraise(*self._exc_info_to_raise)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 958, in managed_session
    start_standard_services=start_standard_services)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 722, in prepare_or_wait_for_session
    max_wait_secs=max_wait_secs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 351, in wait_for_session
    is_ready, not_ready_msg = self._model_ready(sess)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 437, in _model_ready
    return self._ready(self._ready_op, sess, "Model not ready")
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 406, in _ready
    ready_value = sess.run(op)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 710, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 908, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 958, in _do_run
    target_list, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 978, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.InvalidArgumentError: WhereOp: Race condition between counting the number of true elements and writing them.  When counting, saw 2402 elements; but when writing their indices, saw 19 elements.
     [[Node: report_uninitialized_variables/Where = Where[_device="/job:ps/replica:0/task:0/cpu:0"](report_uninitialized_variables/Reshape_1)]]
Caused by op u'report_uninitialized_variables/Where', defined at:
  File "dist_trainer_deepMnist_v2.py", line 205, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
    sys.exit(main(sys.argv))
  File "dist_trainer_deepMnist_v2.py", line 153, in main
    save_model_secs=600)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 310, in __init__
    ready_op=ready_op, ready_for_local_init_op=ready_for_local_init_op)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 399, in _init_ready_op
    ready_op = variables.report_uninitialized_variables()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 1031, in report_uninitialized_variables
    name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/array_ops.py", line 888, in boolean_mask
    return _apply_mask_1d(tensor, mask)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/array_ops.py", line 863, in _apply_mask_1d
    indices = squeeze(where(mask), squeeze_dims=[1])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 2663, in where
    result = _op_def_lib.apply_op("Where", input=input, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 703, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2333, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1252, in __init__
    self._traceback = _extract_stack()

我试图找出导致此错误发生的原因,但无法找到错误。一个线索是错误

中的这一部分
tensorflow.python.framework.errors.InvalidArgumentError: WhereOp: Race condition between counting the number of true elements and writing them.  When counting, saw 2402 elements; but when writing their indices, saw 19 elements.
 [[Node: report_uninitialized_variables/Where = Where[_device="/job:ps/replica:0/task:0/cpu:0"](report_uninitialized_variables/Reshape_1)]]
Caused by op u'report_uninitialized_variables/Where',

另一条线索是

E tensorflow/stream_executor/cuda/cuda_driver.cc:965] failed to allocate 125.53M (131629056 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY

当我在同一台同时使用gpu的机器上运行ps时,就会出现这种情况。如果有人可以解释为什么会发生这种情况,那会很棒吗?

0 个答案:

没有答案