从tf.slim启用带有multi-gpu的XLA JIT

时间:2017-07-31 09:54:57

标签: tensorflow jit multi-gpu xla

我在tf.slim使用multi-gpu(2 titanXp)打开了xla,如下所示。 (编辑train_image_clasifier.py)

   jit_config = tf.ConfigProto()
   jit_level = tf.OptimizerOptions.ON_1
   jit_config.graph_options.optimizer_options.global_jit_level = jit_level

   ###########################
   # Kicks off the training. #
   ###########################

   slim.learning.train(
       .... (same with original)  
       sync_optimizer=optimizer if FLAGS.sync_replicas else None,
       session_config=jit_config)

而且,我确实按照以下命令运行。

>> python train_image_classifier.py 
--train_dir=/tmp/imagenet_train --dataset_name=imagenet 
--dataset_split_name=train --dataset_dir=$DATA_DIR
--model_name=inception_v3 --max_number_of_steps=5000 
--batch_size=64 --num_clones=2

但是,我收到了这些错误消息。

2017-07-31 18:34:05.231408: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_1_bfc) ran out of memory trying to allocate 146.34MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
2017-07-31 18:34:05.246182: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_1_bfc) ran out of memory trying to allocate 1.48GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
......
2017-07-31 18:34:06.311713: E tensorflow/stream_executor/cuda/cuda_driver.cc:1068] failed to synchronize the stop event: CUDA_ERROR_ILLEGAL_ADDRESS
2017-07-31 18:34:06.311761: E tensorflow/stream_executor/cuda/cuda_timer.cc:54] Internal: error destroying CUDA event in context 0xd25d400: CUDA_ERROR_ILLEGAL_ADDRESS
2017-07-31 18:34:06.311790: E tensorflow/stream_executor/cuda/cuda_timer.cc:59] Internal: error destroying CUDA event in context 0xd25d400: CUDA_ERROR_ILLEGAL_ADDRESS
2017-07-31 18:34:06.311855: E tensorflow/stream_executor/cuda/cuda_timer.cc:54] Invalid argument: input event cannot be null
2017-07-31 18:34:06.311869: E tensorflow/stream_executor/cuda/cuda_timer.cc:59] Invalid argument: input event cannot be null
2017-07-31 18:34:06.311942: E tensorflow/stream_executor/cuda/cuda_timer.cc:54] Invalid argument: input event cannot be null
2017-07-31 18:34:06.311959: E tensorflow/stream_executor/cuda/cuda_timer.cc:59] Invalid argument: input event cannot be null
2017-07-31 18:34:06.311973: E tensorflow/compiler/xla/service/gpu/convolution_thunk.cc:325] No convolution algorithm works with profiling. Fall back to the default algorithm.
2017-07-31 18:34:06.311983: E tensorflow/compiler/xla/service/gpu/convolution_thunk.cc:334] No convolution algorithm without scratch works with profiling. Fall back to the default algorithm.
2017-07-31 18:34:06.312037: F tensorflow/stream_executor/cuda/cuda_dnn.cc:2877] failed to enqueue convolution on stream: CUDNN_STATUS_EXECUTION_FAILED
Aborted (core dumped)

以前,xla jit在这些标志上运行良好( - batch_size = 32 --num_clones = 1)。 我认为xla上有关于缓冲区分配的错误。任何人都可以帮助我做错了吗?

0 个答案:

没有答案