当在循环中运行Keras(带有tensorflow)模型并且更改模型参数和输入数据时,我发现随机失败创建回调模型权重文件。我认为它可能与我的目录名称的长度有关,但如果是这样,它似乎是一个错误,因为它有时只会发生。在出现以下错误之前,它会写入具有相同长度目录的多个文件。我在目录中使用长名称,以便更容易区分tensorboard中的运行。我将基本代码设置显示为伪代码,然后显示我得到的随机错误。我有一个嵌套的for循环,它正在改变模型参数以及输入数据。基本循环可以正常工作数小时,然后在循环中的某个点随机失败以获得相同的错误。我想知道我的文件名是否出错导致这种情况。我还想要一个解决方案,以便当它失败时,我可以继续运行并转到下一个文件并跳过失败的文件。某种类型的尝试/除外,但我不太了解h5py知道如何编码。我在Windows 10(conda env),tensorflow-gpu 1.6.0,Keras 2.1.5,h5py 2.7.1,tensorboard 1.6.0上运行。我还设置了Windows 10来处理长文件名。这个错误似乎直接来自h5py(h5py \ h5f.pyx)。此外,文件实际上是创建和写入的。我可以使用h5py.File()加载文件,它是正确的大小,并具有相同的组和对象。更新:我在我之前没有显示的代码中包含了os.makedirs()行。我还添加了对目录创建的检查并再次运行代码。它仍然以同样的方式失败,它从未触发isdir()检查。更新2:当使用带有Tensorflow的Keras时,我想指出我正在使用多处理,因为内存泄漏。无论是否有K.clear_session()和tf.reset_default_graph(),都会发生这种情况。我现在相信这个随机错误与多处理有关,因为当我消除池化过程时我还没有观察到这个错误。
def main():
for input_data in input_data_list:
for model_parameters in model_parameters_list:
# run model with different parameters on all data
pool = multiprocessing.Pool(1)
pool.apply(run_function,run_parameters...,model_func_name,
model_func_dict)
pool.close()
def run_function(run_parameters...,model_func_name,model_func_dict,...):
# code to extract x_train,y_train, x_val, y_val etc not shown
# model_def = long string representing model parameters example below
# model_def =
# 'basic_ff_nn4_mse_dr50_Nadam_LeakyReLU_kr_l2_ar_off_ns_0_BCtoA_all_2_2'
# build and compile model
model = model_func_name(**model_func_dict)
# set up callbacks
os.makedirs(models_dir + "{}_{}_{}_{}/".format(model_def, set_name,
fold, set_num), exist_ok=True)
tmp_path = models_dir + "{}_{}_{}_{}/".format(model_def, set_name, fold,
set_num)
best_weights_file = models_dir + "{}_{}_{}_{}/best_weights.hdf5".format(
model_def, set_name, fold, set_num)
best_model_weights = callbacks.ModelCheckpoint(best_weights_file,
save_best_only=True,
save_weights_only=True)
log_dir = 'output/{}_{}/tf_logs/{}/{}/{}'.format(model_type, cur_time,
model_def, set_name,
'f' + str(fold))
tensorboard = callbacks.TensorBoard(log_dir=log_dir,
histogram_freq=0, write_graph=False,
write_images=False,
write_grads=False)
if not os.path.isdir(tmp_path):
print('path not created = ',tmp_path)
model_history = model.fit(x=x_train, y=y_train,
verbose=0,
batch_size=size_batches,
epochs=num_epochs,
validation_data=[x_val, y_val],
callbacks=[best_model_weights, tensorboard],
)
K.clear_session()
tf.reset_default_graph()
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "C:\ProgramData\Miniconda3\envs\tflow_g\lib\multiprocessing\pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "C:\Users\xxxxx\Dropbox (test)\Lab\VLL models\zakworkspace\cps\cps_main.py", line 1042, in run_joint_ff
callbacks=[best_model_weights, tensorboard],
File "C:\ProgramData\Miniconda3\envs\tflow_g\lib\site-packages\keras\models.py", line 963, in fit
validation_steps=validation_steps)
File "C:\ProgramData\Miniconda3\envs\tflow_g\lib\site-packages\keras\engine\training.py", line 1705, in fit
validation_steps=validation_steps)
File "C:\ProgramData\Miniconda3\envs\tflow_g\lib\site-packages\keras\engine\training.py", line 1255, in _fit_loop
callbacks.on_epoch_end(epoch, epoch_logs)
File "C:\ProgramData\Miniconda3\envs\tflow_g\lib\site-packages\keras\callbacks.py", line 77, in on_epoch_end
callback.on_epoch_end(epoch, logs)
File "C:\ProgramData\Miniconda3\envs\tflow_g\lib\site-packages\keras\callbacks.py", line 445, in on_epoch_end
self.model.save_weights(filepath, overwrite=True)
File "C:\ProgramData\Miniconda3\envs\tflow_g\lib\site-packages\keras\models.py", line 754, in save_weights
with h5py.File(filepath, 'w') as f:
File "C:\ProgramData\Miniconda3\envs\tflow_g\lib\site-packages\h5py\_hl\files.py", line 269, in __init__
fid = make_fid(name, mode, userblock_size, fapl, swmr=swmr)
File "C:\ProgramData\Miniconda3\envs\tflow_g\lib\site-packages\h5py\_hl\files.py", line 105, in make_fid
fid = h5f.create(name, h5f.ACC_TRUNC, fapl=fapl, fcpl=fcpl)
File "h5py\_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py\_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py\h5f.pyx", line 98, in h5py.h5f.create
OSError: Unable to create file (unable to open file: name = 'output/TB_runs_03122018-031837/dump/models/basic_ff_nn4_mse_dr50_Nadam_LeakyReLU_kr_l2_ar_off_ns_0_BCtoA_all_2_2/best_weights.hdf5', errno = 22, error message = 'Invalid argument', flags = 13, o_flags = 302)
"""