Question

我按照以下帖子的说明创建了自己的数据，用于培训CIFAR10网络：How to create dataset similar to cifar-10。我的数据存储在名为bag1-data.bin

的文件中

我编辑了所有源代码，以便使用我的数据训练网络。数据集不是那么大（1149个图像），网络现在必须只预测两个类，所以当我尝试对这些数据运行CIFAR10培训时，我有以下错误：

tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0 1 
2017-05-25 04:27:17.614312: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0:   Y Y 
2017-05-25 04:27:17.614346: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 1:   Y Y 
2017-05-25 04:27:17.614386: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 980, pci bus id: 0000:03:00.0)
2017-05-25 04:27:17.614425: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:1) -> (device: 1, name: GeForce GTX 980, pci bus id: 0000:04:00.0)
Traceback (most recent call last):
  File "cifar10_multi_gpu_train.py", line 272, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "cifar10_multi_gpu_train.py", line 268, in main
    train()
  File "cifar10_multi_gpu_train.py", line 240, in train
    assert not np.isnan(loss_value), 'Model diverged with loss = NaN'
AssertionError: Model diverged with loss = NaN

我读到可能会发生这种情况，因为渐变是爆炸性的，但我试图尽可能地调整学习率，即使使用INITIAL_LEARNING_RATE = 0.0000000001，错误也会继续显示。

我可以在这些数据上运行CIFAR网络，但是使用MatConvNet。即使将网络参数从该库复制到Tensorflow，问题仍然存在。

我做错了什么？这个问题与我自己的数据生成有关吗？是否有任何参数调整可以帮助我进行培训？

对我的数据进行CIFAR10-Tensorflow培训：AssertionError：模型与损失= NaN分歧

0 个答案: