在每个时期之后,我有以下回调:
但是,在训练的第一个时期之后,我得到以下回溯。我假设这与检查点回调有关。
这是正常行为吗?
我的 callbacks.py ,其中所有回调均在create_callbacks()
def create_callbacks(job_dir, logs_path):
checkpoint_path = 'checkpoint.{epoch:04d}-{val_loss:.9f}.hdf5'
if not job_dir.startswith("gs://"): # then local
checkpoint_path = os.path.join(job_dir + 'checkpoints', checkpoint_path)
checkpoint = tf.keras.callbacks.ModelCheckpoint(checkpoint_path, monitor='val_loss', verbose=0, save_best_only=True,
save_weights_only=False,
mode='auto', period=1)
tb = tf.keras.callbacks.TensorBoard(log_dir=logs_path, batch_size=None, histogram_freq=0, write_graph=False)
# Continuous eval callback
export = ContinuousExport(eval_frequency=1, job_dir=job_dir)
return [checkpoint, tb, export]
class ContinuousExport(tf.keras.callbacks.Callback):
"""Continuous eval callback to evaluate the checkpoint once every so many epochs."""
def __init__(self, eval_frequency, job_dir,):
self.eval_frequency = eval_frequency
self.job_dir = job_dir
def on_epoch_end(self, epoch, logs={}):
print('Epoch number is {}'.format(epoch))
print('Frequency is {}'.format(self.eval_frequency))
if epoch > 0 and epoch % self.eval_frequency == 0:
# Unhappy hack to work around h5py not being able to write to GCS.
# Force snapshots and saves to local filesystem, then copy them over to GCS.
model_path_glob = 'checkpoint.*'
if not self.job_dir.startswith("gs://"):
model_path_glob = os.path.join(self.job_dir + 'checkpoints', model_path_glob)
checkpoints = sorted(glob.glob(model_path_glob), key=os.path.getmtime)
print('Path is {}'.format(model_path_glob))
print('Length of cp is {}'.format(len(checkpoints)))
if len(checkpoints) > 0:
print(checkpoints[-1])
if self.job_dir.startswith("gs://"):
print('Copying the model to {}'.format(self.job_dir + '/checkpoints/'))
copy_file_to_gcs(self.job_dir + '/checkpoints/', checkpoints[-1])
else:
print('Using local storage, not saving to GCS')
else:
print('\nEvaluation epoch[{}] (no checkpoints found)'.format(epoch))
def copy_file_to_gcs(job_dir, file_path):
with file_io.FileIO(file_path, mode='rb') as input_f:
with file_io.FileIO(os.path.join(job_dir, file_path), mode='w+') as output_f:
output_f.write(input_f.read())
INFO 2018-10-08 12:17:30 +0100主副本0
模块完成;打扫干净。 INFO 2018-10-08 12:17:30 +0100
master-replica-0清理完成。错误2018-10-08 12:18:23 +0100服务复制主服务器0退出 非零状态1。错误2018-10-08 12:18:23 +0100
服务回溯(最近一次拨打电话):错误2018-10-08 12:18:23 +0100服务文件 _run_module_as_main中的“ /usr/lib/python3.5/runpy.py”,第184行,错误 2018-10-08 12:18:23 +0100服务“ 主要”, mod_spec)错误2018-10-08 12:18:23 +0100服务
_run_code错误中的文件“ /usr/lib/python3.5/runpy.py”,第85行,
2018-10-08 12:18:23 +0100服务执行程序(代码, run_globals)错误2018-10-08 12:18:23 +0100服务
文件“ /root/.local/lib/python3.5/site-packages/trainer/model.py”,行 167,发生错误2018-10-08 12:18:23 +0100服务
train_model(train_file = train_file,test_file = test_file, job_dir = job_dir,** arguments)错误2018-10-08 12:18:23 +0100
服务文件 “ /root/.local/lib/python3.5/site-packages/trainer/model.py”,第59行, 在train_model中出现错误2018-10-08 12:18:23 +0100服务
模型= fit_model(模型,train_g,test_g,回调)错误
2018-10-08 12:18:23 +0100服务文件 “ /root/.local/lib/python3.5/site-packages/trainer/model.py”,第124行, in fit_model错误2018-10-08 12:18:23 +0100服务
model.fit_generator(** params)错误2018-10-08 12:18:23 +0100
服务档案 “ /usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/_impl/keras/engine/training.py”, 1598行,在fit_generator中出现错误2018-10-08 12:18:23 +0100
服务initial_epoch = initial_epoch)错误2018-10-08 12:18:23 +0100服务文件 “ /usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/_impl/keras/engine/training_generator.py”, 在fit_generator中出现第231行错误2018-10-08 12:18:23 +0100
服务回调.on_epoch_end(epoch,epoch_logs)错误
2018-10-08 12:18:23 +0100服务文件 “ /usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/_impl/keras/callbacks.py”, 第95行,on_epoch_end错误2018-10-08 12:18:23 +0100
服务callback.on_epoch_end(epoch,logs)错误
2018-10-08 12:18:23 +0100服务文件 “ /usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/_impl/keras/callbacks.py”, 第468行,on_epoch_end错误2018-10-08 12:18:23 +0100
服务self.model.save(filepath,overwrite = True)错误
2018-10-08 12:18:23 +0100服务文件 “ /usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/_impl/keras/engine/network.py”, 第1126行,保存时发生错误2018-10-08 12:18:23 +0100服务
save_model(自我,文件路径,覆盖,include_optimizer)错误
2018-10-08 12:18:23 +0100服务文件 “ /usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/_impl/keras/engine/saving.py”, 第75行,在save_model中出现错误2018-10-08 12:18:23 +0100服务 引发ImportError('save_model
需要h5py。')错误2018-10-08 12:18:23 +0100服务ImportError:save_model
需要h5py。
答案 0 :(得分:0)
是的,您需要安装软件包h5py。
h5py文件是用于存储训练后的模型的容器。如果您尚未安装h5py软件包,则无法保存模型。
可以通过PyPI的pip安装预构建的h5py车轮
$ pip install h5py