无法在CloudML(1.8)错误上保存Keras检查点:ImportError:`save_model`需要h5py

时间:2018-10-08 11:31:03

标签: python tensorflow keras

在每个时期之后,我有以下回调:

  1. 创建一个TensorBoard。
  2. 保存模型检查点。

但是,在训练的第一个时期之后,我得到以下回溯。我假设这与检查点回调有关。

这是正常行为吗?

我的 callbacks.py ,其中所有回调均在create_callbacks()

中创建
def create_callbacks(job_dir, logs_path):

    checkpoint_path = 'checkpoint.{epoch:04d}-{val_loss:.9f}.hdf5'

    if not job_dir.startswith("gs://"):  # then local
        checkpoint_path = os.path.join(job_dir + 'checkpoints', checkpoint_path)

    checkpoint = tf.keras.callbacks.ModelCheckpoint(checkpoint_path, monitor='val_loss', verbose=0, save_best_only=True,
                                 save_weights_only=False,
                                 mode='auto', period=1)

    tb = tf.keras.callbacks.TensorBoard(log_dir=logs_path, batch_size=None, histogram_freq=0, write_graph=False)

    # Continuous eval callback
    export = ContinuousExport(eval_frequency=1, job_dir=job_dir)

    return [checkpoint, tb, export]


class ContinuousExport(tf.keras.callbacks.Callback):
    """Continuous eval callback to evaluate the checkpoint once every so many epochs."""

    def __init__(self, eval_frequency, job_dir,):
        self.eval_frequency = eval_frequency
        self.job_dir = job_dir

    def on_epoch_end(self, epoch, logs={}):
        print('Epoch number is {}'.format(epoch))
        print('Frequency is {}'.format(self.eval_frequency))
        if epoch > 0 and epoch % self.eval_frequency == 0:
            # Unhappy hack to work around h5py not being able to write to GCS.
            # Force snapshots and saves to local filesystem, then copy them over to GCS.
            model_path_glob = 'checkpoint.*'
            if not self.job_dir.startswith("gs://"):
                model_path_glob = os.path.join(self.job_dir + 'checkpoints', model_path_glob)
            checkpoints = sorted(glob.glob(model_path_glob), key=os.path.getmtime)
            print('Path is {}'.format(model_path_glob))
            print('Length of cp is {}'.format(len(checkpoints)))
            if len(checkpoints) > 0:
                print(checkpoints[-1])
                if self.job_dir.startswith("gs://"):
                    print('Copying the model to {}'.format(self.job_dir + '/checkpoints/'))
                    copy_file_to_gcs(self.job_dir + '/checkpoints/', checkpoints[-1])
                else:
                    print('Using local storage, not saving to GCS')
        else:
            print('\nEvaluation epoch[{}] (no checkpoints found)'.format(epoch))


def copy_file_to_gcs(job_dir, file_path):
    with file_io.FileIO(file_path, mode='rb') as input_f:
        with file_io.FileIO(os.path.join(job_dir, file_path), mode='w+') as output_f:
            output_f.write(input_f.read())
  

INFO 2018-10-08 12:17:30 +0100主副本0
  模块完成;打扫干净。 INFO 2018-10-08 12:17:30 +0100
  master-replica-0清理完成。错误2018-10-08   12:18:23 +0100服务复制主服务器0退出   非零状态1。错误2018-10-08 12:18:23 +0100
  服务回溯(最近一次拨打电话):错误2018-10-08   12:18:23 +0100服务文件   _run_module_as_main中的“ /usr/lib/python3.5/runpy.py”,第184行,错误   2018-10-08 12:18:23 +0100服务“ 主要”,   mod_spec)错误2018-10-08 12:18:23 +0100服务
  _run_code错误中的文件“ /usr/lib/python3.5/runpy.py”,第85行,
  2018-10-08 12:18:23 +0100服务执行程序(代码,   run_globals)错误2018-10-08 12:18:23 +0100服务
  文件“ /root/.local/lib/python3.5/site-packages/trainer/model.py”,行   167,发生错误2018-10-08 12:18:23 +0100服务
  train_model(train_file = train_file,test_file = test_file,   job_dir = job_dir,** arguments)错误2018-10-08 12:18:23 +0100
  服务文件   “ /root/.local/lib/python3.5/site-packages/trainer/model.py”,第59行,   在train_model中出现错误2018-10-08 12:18:23 +0100服务
  模型= fit_model(模型,train_g,test_g,回调)错误
  2018-10-08 12:18:23 +0100服务文件   “ /root/.local/lib/python3.5/site-packages/trainer/model.py”,第124行,   in fit_model错误2018-10-08 12:18:23 +0100服务
  model.fit_generator(** params)错误2018-10-08 12:18:23 +0100
  服务档案   “ /usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/_impl/keras/engine/training.py”,   1598行,在fit_generator中出现错误2018-10-08 12:18:23 +0100
  服务initial_epoch = initial_epoch)错误2018-10-08   12:18:23 +0100服务文件   “ /usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/_impl/keras/engine/training_generator.py”,   在fit_generator中出现第231行错误2018-10-08 12:18:23 +0100
  服务回调.on_epoch_end(epoch,epoch_logs)错误
  2018-10-08 12:18:23 +0100服务文件   “ /usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/_impl/keras/callbacks.py”,   第95行,on_epoch_end错误2018-10-08 12:18:23 +0100
  服务callback.on_epoch_end(epoch,logs)错误
  2018-10-08 12:18:23 +0100服务文件   “ /usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/_impl/keras/callbacks.py”,   第468行,on_epoch_end错误2018-10-08 12:18:23 +0100
  服务self.model.save(filepath,overwrite = True)错误
  2018-10-08 12:18:23 +0100服务文件   “ /usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/_impl/keras/engine/network.py”,   第1126行,保存时发生错误2018-10-08 12:18:23 +0100服务
  save_model(自我,文件路径,覆盖,include_optimizer)错误
  2018-10-08 12:18:23 +0100服务文件   “ /usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/_impl/keras/engine/saving.py”,   第75行,在save_model中出现错误2018-10-08 12:18:23 +0100服务   引发ImportError('save_model需要h5py。')错误2018-10-08   12:18:23 +0100服务ImportError:save_model   需要h5py。

1 个答案:

答案 0 :(得分:0)

是的,您需要安装软件包h5py。

h5py文件是用于存储训练后的模型的容器。如果您尚未安装h5py软件包,则无法保存模型。

可以通过PyPI的pip安装预构建的h5py车轮

$ pip install h5py