Question

在使用Keras和Jupyter Notebook时，一旦我开始训练模型，偶尔会出现一个错误（有关完整的错误日志，请参见下文）。尽管subscribing to this github认为这与版本冲突有关，但在我的情况下似乎并不适用。就我而言，我的版本似乎可以正常运行，因为大多数时候我都可以运行训练程序，但是一旦出现此错误，我需要关闭所有正在运行的python进程并重新启动Anaconda，以便继续进行而不会出现错误。

由于每次发生此错误都要重新启动Anaconda都是非常不方便的，所以我想知道是否存在除版本冲突以外的其他原因导致此错误的修复或建议？

这是我得到的全部错误：

---------------------------------------------------------------------------
UnknownError                              Traceback (most recent call last)
<ipython-input-23-5d485feb54c5> in <module>
      1 K.clear_session()
      2 model_all = define_model(train_data)
----> 3 model_all = train_bild(train_generator_all,validation_generator_all, model_all)
      4 model_all.save(subdir+cat+"/"+cat+"_model_all_inception.h5")
      5 

<ipython-input-17-afb528e9309d> in train_bild(train_generator, validation_generator, model)
     25         epochs=num_epochs,
     26         validation_data=validation_generator,
---> 27         validation_steps=VALID_STEPS, workers=16,callbacks=[checker,early, reduce_lr],class_weight=class_weights)#,class_weight=class_weights)
     28 
     29     model = load_model(filepath)

~\AppData\Local\Continuum\anaconda3\envs\tensorflow-gpu\lib\site-packages\keras\legacy\interfaces.py in wrapper(*args, **kwargs)
     89                 warnings.warn('Update your `' + object_name + '` call to the ' +
     90                               'Keras 2 API: ' + signature, stacklevel=2)
---> 91             return func(*args, **kwargs)
     92         wrapper._original_function = func
     93         return wrapper

~\AppData\Local\Continuum\anaconda3\envs\tensorflow-gpu\lib\site-packages\keras\engine\training.py in fit_generator(self, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
   1416             use_multiprocessing=use_multiprocessing,
   1417             shuffle=shuffle,
-> 1418             initial_epoch=initial_epoch)
   1419 
   1420     @interfaces.legacy_generator_methods_support

~\AppData\Local\Continuum\anaconda3\envs\tensorflow-gpu\lib\site-packages\keras\engine\training_generator.py in fit_generator(model, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
    215                 outs = model.train_on_batch(x, y,
    216                                             sample_weight=sample_weight,
--> 217                                             class_weight=class_weight)
    218 
    219                 outs = to_list(outs)

~\AppData\Local\Continuum\anaconda3\envs\tensorflow-gpu\lib\site-packages\keras\engine\training.py in train_on_batch(self, x, y, sample_weight, class_weight)
   1215             ins = x + y + sample_weights
   1216         self._make_train_function()
-> 1217         outputs = self.train_function(ins)
   1218         return unpack_singleton(outputs)
   1219 

~\AppData\Local\Continuum\anaconda3\envs\tensorflow-gpu\lib\site-packages\keras\backend\tensorflow_backend.py in __call__(self, inputs)
   2713                 return self._legacy_call(inputs)
   2714 
-> 2715             return self._call(inputs)
   2716         else:
   2717             if py_any(is_tensor(x) for x in inputs):

~\AppData\Local\Continuum\anaconda3\envs\tensorflow-gpu\lib\site-packages\keras\backend\tensorflow_backend.py in _call(self, inputs)
   2673             fetched = self._callable_fn(*array_vals, run_metadata=self.run_metadata)
   2674         else:
-> 2675             fetched = self._callable_fn(*array_vals)
   2676         return fetched[:len(self.outputs)]
   2677 

~\AppData\Local\Continuum\anaconda3\envs\tensorflow-gpu\lib\site-packages\tensorflow\python\client\session.py in __call__(self, *args, **kwargs)
   1437           ret = tf_session.TF_SessionRunCallable(
   1438               self._session._session, self._handle, args, status,
-> 1439               run_metadata_ptr)
   1440         if run_metadata:
   1441           proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

~\AppData\Local\Continuum\anaconda3\envs\tensorflow-gpu\lib\site-packages\tensorflow\python\framework\errors_impl.py in __exit__(self, type_arg, value_arg, traceback_arg)
    526             None, None,
    527             compat.as_text(c_api.TF_Message(self.status.status)),
--> 528             c_api.TF_GetCode(self.status.status))
    529     # Delete the underlying status object from memory otherwise it stays alive
    530     # as there is a reference to status from this from the traceback due to

UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
     [[{{node conv2d_1/convolution}} = Conv2D[T=DT_FLOAT, _class=["loc:@batch_normalization_1/cond_1/FusedBatchNorm/Switch"], data_format="NCHW", dilations=[1, 1, 1, 1], padding="VALID", strides=[1, 1, 2, 2], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](conv2d_1/convolution-0-TransposeNHWCToNCHW-LayoutOptimizer, conv2d_1/kernel/read)]]
     [[{{node loss/mul/_4005}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_4855_loss/mul", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Answer 1

我几次遇到这个问题，所有的原因都是由于Saver试图还原一个肮脏的日志文件-唯一的解决方案是删除最后一个模型检查点文件并从上一个模型检查点文件重新启动（还删除了这行指向checkpoint.txt文件中的最后一个）。

在模型保存期间发生某种情况时可能会发生这种情况（由保护程序处理的模子-某些内容在仍处于写入状态时更改了文件，...）

Answer 2

https://github.com/tensorflow/tensorflow/issues/24828中的错误代码（在注释中由马斯特夫（Mastiff）链接）是这样的：

# python 3.6 and tensorflow (both 1.x and 2.0)
def allow_gpu_memory_growth(log_device_placement=True):
    """
    Allow dynamic memory growth (by default, tensorflow allocates all gpu memory).
    This sometimes fixes the 
    <<Error : Failed to get convolution algorithm. 
    This is probably because cuDNN failed to initialize, 
    so try looking to see if a warning log message was printed above>>. 
    May hurt performance slightly (see https://www.tensorflow.org/guide/gpu).

    Usage: Run before any other code.

    :param log_device_placement: set True to log device placement (on which device the operation ran)
    :return:None
    """
    from tensorflow.compat.v1.keras.backend import set_session
    config = tf.compat.v1.ConfigProto()
    config.gpu_options.allow_growth = True  # dynamically grow the memory used on the GPU
    config.log_device_placement = log_device_placement
    sess = tf.compat.v1.Session(config=config)
    set_session(sess)

Keras-UnknownError：无法获得卷积算法

2 个答案: