应用错误收集

Tensorflow CNN培训无故停止

时间：2019-03-04 17:17:14

标签： python tensorflow neural-network conv-neural-network

我有这段Python代码，用于使用Tensorflow训练CNN。在我决定使用 MirroredStrategy 改善性能之前，它一直运行得很好。然后有必要进行一些更改：

我不得不停止使用tf.estimator.inputs.numpy_input_fn()，因为MirroredStrategy需要输入函数来返回数据集。
我基本上从Datasets for Estimators复制了train_input_fn用作输入功能

现在它似乎可以正常工作，但是它根本没有训练任何东西，几秒钟后它停止运行，没有任何错误。

更多细节：

在培训期间，我正在监视CPU和GPU的使用情况，当我运行代码时，GPU频率会升高，直到程序停止运行。
我已经尝试更改批次大小，但没有更改。
我的GPU是GTX 1070，驱动程序也已更新，并且我在我的环境中使用tensorflow-gpu版本1.12.0

内核输出：

In [1]: runfile('D:/MS/CNN.py', wdir='D:/MS')
INFO:tensorflow:Initializing RunConfig with distribution strategies.
INFO:tensorflow:Not using Distribute Coordinator.
INFO:tensorflow:Using config: {'_model_dir': 'models3', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': <tensorflow.contrib.distribute.python.mirrored_strategy.MirroredStrategy object at 0x0000026729AC1320>, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x0000026729AC14A8>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_distribute_coordinator_mode': None}
INFO:tensorflow:Device is available but not used by distribute strategy: /device:CPU:0
INFO:tensorflow:Configured nccl all-reduce.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:batch_all_reduce invoked for batches size = 16 with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10

In [1]:

培训代码：

def train_input_fn(features, labels, batch_size):
    # Convert the inputs to a Dataset.
    dataset = tf.data.Dataset.from_tensor_slices((features, labels))
    # Shuffle, repeat, and batch the examples.
    dataset = dataset.shuffle(1000).repeat(count=100).batch(batch_size)
    # Return the dataset.
    return dataset

gpu_distribution = tf.contrib.distribute.MirroredStrategy()
run_config = tf.estimator.RunConfig(train_distribute=gpu_distribution)
classifier = tf.estimator.Estimator(model_fn= cria_rede, config=run_config, model_dir='models3')
classifier.train(input_fn=lambda: train_input_fn(x_data, y_data, 2), steps= 3000)
#x_data and y_data are bidimensional numpy.ndarrays

0 个答案:

没有答案