为什么h5py.h5f.create因OSError失败:无法打开文件:name =" name",

时间:2018-03-12 16:21:42

标签: python tensorflow keras h5py

当在循环中运行Keras(带有tensorflow)模型并且更改模型参数和输入数据时,我发现随机失败创建回调模型权重文件。我认为它可能与我的目录名称的长度有关,但如果是这样,它似乎是一个错误,因为它有时只会发生。在出现以下错误之前,它会写入具有相同长度目录的多个文件。我在目录中使用长名称,以便更容易区分tensorboard中的运行。我将基本代码设置显示为伪代码,然后显示我得到的随机错误。我有一个嵌套的for循环,它正在改变模型参数以及输入数据。基本循环可以正常工作数小时,然后在循环中的某个点随机失败以获得相同的错误。我想知道我的文件名是否出错导致这种情况。我还想要一个解决方案,以便当它失败时,我可以继续运行并转到下一个文件并跳过失败的文件。某种类型的尝试/除外,但我不太了解h5py知道如何编码。我在Windows 10(conda env),tensorflow-gpu 1.6.0,Keras 2.1.5,h5py 2.7.1,tensorboard 1.6.0上运行。我还设置了Windows 10来处理长文件名。这个错误似乎直接来自h5py(h5py \ h5f.pyx)。此外,文件实际上是创建和写入的。我可以使用h5py.File()加载文件,它是正确的大小,并具有相同的组和对象。更新:我在我之前没有显示的代码中包含了os.makedirs()行。我还添加了对目录创建的检查并再次运行代码。它仍然以同样的方式失败,它从未触发isdir()检查。更新2:当使用带有Tensorflow的Keras时,我想指出我正在使用多处理,因为内存泄漏。无论是否有K.clear_session()和tf.reset_default_graph(),都会发生这种情况。我现在相信这个随机错误与多处理有关,因为当我消除池化过程时我还没有观察到这个错误。

def main():
    for input_data in input_data_list:
        for model_parameters in model_parameters_list:
            # run model with different parameters on all data
            pool = multiprocessing.Pool(1)
            pool.apply(run_function,run_parameters...,model_func_name,
                       model_func_dict)
            pool.close()

def run_function(run_parameters...,model_func_name,model_func_dict,...):
    # code to extract x_train,y_train, x_val, y_val etc not shown
    # model_def = long string representing model parameters example below
    # model_def =
    # 'basic_ff_nn4_mse_dr50_Nadam_LeakyReLU_kr_l2_ar_off_ns_0_BCtoA_all_2_2'
    # build and compile model
    model = model_func_name(**model_func_dict)
    # set up callbacks
    os.makedirs(models_dir + "{}_{}_{}_{}/".format(model_def, set_name, 
                 fold, set_num), exist_ok=True)
    tmp_path = models_dir + "{}_{}_{}_{}/".format(model_def, set_name, fold, 
                                                   set_num)
    best_weights_file = models_dir + "{}_{}_{}_{}/best_weights.hdf5".format(
        model_def, set_name, fold, set_num)
    best_model_weights = callbacks.ModelCheckpoint(best_weights_file,
                                                   save_best_only=True,
                                                   save_weights_only=True)
    log_dir = 'output/{}_{}/tf_logs/{}/{}/{}'.format(model_type, cur_time,
                                                     model_def, set_name,
                                                     'f' + str(fold))
    tensorboard = callbacks.TensorBoard(log_dir=log_dir,
                                        histogram_freq=0, write_graph=False,
                                        write_images=False, 
                                         write_grads=False)
    if not os.path.isdir(tmp_path):
        print('path not created = ',tmp_path)
    model_history = model.fit(x=x_train, y=y_train,
                              verbose=0,
                              batch_size=size_batches,
                              epochs=num_epochs,
                              validation_data=[x_val, y_val],
                              callbacks=[best_model_weights, tensorboard],
                              )
    K.clear_session()
    tf.reset_default_graph()

multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "C:\ProgramData\Miniconda3\envs\tflow_g\lib\multiprocessing\pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "C:\Users\xxxxx\Dropbox (test)\Lab\VLL models\zakworkspace\cps\cps_main.py", line 1042, in run_joint_ff
    callbacks=[best_model_weights, tensorboard],
  File "C:\ProgramData\Miniconda3\envs\tflow_g\lib\site-packages\keras\models.py", line 963, in fit
    validation_steps=validation_steps)
  File "C:\ProgramData\Miniconda3\envs\tflow_g\lib\site-packages\keras\engine\training.py", line 1705, in fit
    validation_steps=validation_steps)
  File "C:\ProgramData\Miniconda3\envs\tflow_g\lib\site-packages\keras\engine\training.py", line 1255, in _fit_loop
    callbacks.on_epoch_end(epoch, epoch_logs)
  File "C:\ProgramData\Miniconda3\envs\tflow_g\lib\site-packages\keras\callbacks.py", line 77, in on_epoch_end
    callback.on_epoch_end(epoch, logs)
  File "C:\ProgramData\Miniconda3\envs\tflow_g\lib\site-packages\keras\callbacks.py", line 445, in on_epoch_end
    self.model.save_weights(filepath, overwrite=True)
  File "C:\ProgramData\Miniconda3\envs\tflow_g\lib\site-packages\keras\models.py", line 754, in save_weights
    with h5py.File(filepath, 'w') as f:
  File "C:\ProgramData\Miniconda3\envs\tflow_g\lib\site-packages\h5py\_hl\files.py", line 269, in __init__
    fid = make_fid(name, mode, userblock_size, fapl, swmr=swmr)
  File "C:\ProgramData\Miniconda3\envs\tflow_g\lib\site-packages\h5py\_hl\files.py", line 105, in make_fid
    fid = h5f.create(name, h5f.ACC_TRUNC, fapl=fapl, fcpl=fcpl)
  File "h5py\_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py\_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py\h5f.pyx", line 98, in h5py.h5f.create
OSError: Unable to create file (unable to open file: name = 'output/TB_runs_03122018-031837/dump/models/basic_ff_nn4_mse_dr50_Nadam_LeakyReLU_kr_l2_ar_off_ns_0_BCtoA_all_2_2/best_weights.hdf5', errno = 22, error message = 'Invalid argument', flags = 13, o_flags = 302)
"""

0 个答案:

没有答案