Tensorflow CNN培训无故停止

时间:2019-03-04 17:17:14

标签: python tensorflow neural-network conv-neural-network

我有这段Python代码,用于使用Tensorflow训练CNN。 在我决定使用 MirroredStrategy 改善性能之前,它一直运行得很好。然后有必要进行一些更改:

  1. 我不得不停止使用tf.estimator.inputs.numpy_input_fn(),因为MirroredStrategy需要输入函数来返回数据集。
  2. 我基本上从Datasets for Estimators复制了train_input_fn用作输入功能

现在它似乎可以正常工作,但是它根本没有训练任何东西,几秒钟后它停止运行,没有任何错误。

更多细节:

  1. 在培训期间,我正在监视CPU和GPU的使用情况,当我运行代码时,GPU频率会升高,直到程序停止运行。
  2. 我已经尝试更改批次大小,但没有更改。
  3. 我的GPU是GTX 1070,驱动程序也已更新,并且我在我的环境中使用tensorflow-gpu版本1.12.0

内核输出:

In [1]: runfile('D:/MS/CNN.py', wdir='D:/MS')
INFO:tensorflow:Initializing RunConfig with distribution strategies.
INFO:tensorflow:Not using Distribute Coordinator.
INFO:tensorflow:Using config: {'_model_dir': 'models3', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': <tensorflow.contrib.distribute.python.mirrored_strategy.MirroredStrategy object at 0x0000026729AC1320>, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x0000026729AC14A8>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_distribute_coordinator_mode': None}
INFO:tensorflow:Device is available but not used by distribute strategy: /device:CPU:0
INFO:tensorflow:Configured nccl all-reduce.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:batch_all_reduce invoked for batches size = 16 with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10

In [1]:

培训代码:

def train_input_fn(features, labels, batch_size):
    # Convert the inputs to a Dataset.
    dataset = tf.data.Dataset.from_tensor_slices((features, labels))
    # Shuffle, repeat, and batch the examples.
    dataset = dataset.shuffle(1000).repeat(count=100).batch(batch_size)
    # Return the dataset.
    return dataset

gpu_distribution = tf.contrib.distribute.MirroredStrategy()
run_config = tf.estimator.RunConfig(train_distribute=gpu_distribution)
classifier = tf.estimator.Estimator(model_fn= cria_rede, config=run_config, model_dir='models3')
classifier.train(input_fn=lambda: train_input_fn(x_data, y_data, 2), steps= 3000)
#x_data and y_data are bidimensional numpy.ndarrays

0 个答案:

没有答案