使用多个GPU和ModelCheckpoint进行培训会导致异常

时间:2017-11-08 12:14:34

标签: tensorflow deep-learning keras gpu

我正在训练1D CNN,带有两个带有Keras的GPU(2xK80)(TensorFlow作为后端)。

我遇到的问题

问题是(我的猜测)我正在尝试保存一个gpu的模型权重而另一个gpu处于训练中(或类似的东西)所以我相信我正在寻找一种方法完成后停止适合过程,保存重量,然后转到下一个时代。

我收到的例外

File "/root/miniconda3/lib/python3.5/site-packages/keras/engine/topology.py", line 2622, in load_weights
    load_weights_from_hdf5_group(f, self.layers)
  File "/root/miniconda3/lib/python3.5/site-packages/keras/engine/topology.py", line 3103, in load_weights_from_hdf5_group
    layer_names = [n.decode('utf8') for n in f.attrs['layer_names']]
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/root/miniconda3/lib/python3.5/site-packages/h5py/_hl/attrs.py", line 60, in __getitem__
    attr = h5a.open(self._id, self._e(name))
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5a.pyx", line 77, in h5py.h5a.open
KeyError: "Can't open attribute (can't locate attribute: 'layer_names')"
root@algoGpu:/home/gpu_user/SourceCode/voc#

问题是 如何在多个GPU上训练模型,同时使用ModelCheckpoint来保存最佳纪元的权重?

0 个答案:

没有答案