Question

在每个时期之后，我有以下回调：

创建一个TensorBoard。
保存模型检查点。

但是，在训练的第一个时期之后，我得到以下回溯。我假设这与检查点回调有关。

这是正常行为吗？

我的 callbacks.py ，其中所有回调均在create_callbacks()

中创建

def create_callbacks(job_dir, logs_path):

    checkpoint_path = 'checkpoint.{epoch:04d}-{val_loss:.9f}.hdf5'

    if not job_dir.startswith("gs://"):  # then local
        checkpoint_path = os.path.join(job_dir + 'checkpoints', checkpoint_path)

    checkpoint = tf.keras.callbacks.ModelCheckpoint(checkpoint_path, monitor='val_loss', verbose=0, save_best_only=True,
                                 save_weights_only=False,
                                 mode='auto', period=1)

    tb = tf.keras.callbacks.TensorBoard(log_dir=logs_path, batch_size=None, histogram_freq=0, write_graph=False)

    # Continuous eval callback
    export = ContinuousExport(eval_frequency=1, job_dir=job_dir)

    return [checkpoint, tb, export]


class ContinuousExport(tf.keras.callbacks.Callback):
    """Continuous eval callback to evaluate the checkpoint once every so many epochs."""

    def __init__(self, eval_frequency, job_dir,):
        self.eval_frequency = eval_frequency
        self.job_dir = job_dir

    def on_epoch_end(self, epoch, logs={}):
        print('Epoch number is {}'.format(epoch))
        print('Frequency is {}'.format(self.eval_frequency))
        if epoch > 0 and epoch % self.eval_frequency == 0:
            # Unhappy hack to work around h5py not being able to write to GCS.
            # Force snapshots and saves to local filesystem, then copy them over to GCS.
            model_path_glob = 'checkpoint.*'
            if not self.job_dir.startswith("gs://"):
                model_path_glob = os.path.join(self.job_dir + 'checkpoints', model_path_glob)
            checkpoints = sorted(glob.glob(model_path_glob), key=os.path.getmtime)
            print('Path is {}'.format(model_path_glob))
            print('Length of cp is {}'.format(len(checkpoints)))
            if len(checkpoints) > 0:
                print(checkpoints[-1])
                if self.job_dir.startswith("gs://"):
                    print('Copying the model to {}'.format(self.job_dir + '/checkpoints/'))
                    copy_file_to_gcs(self.job_dir + '/checkpoints/', checkpoints[-1])
                else:
                    print('Using local storage, not saving to GCS')
        else:
            print('\nEvaluation epoch[{}] (no checkpoints found)'.format(epoch))


def copy_file_to_gcs(job_dir, file_path):
    with file_io.FileIO(file_path, mode='rb') as input_f:
        with file_io.FileIO(os.path.join(job_dir, file_path), mode='w+') as output_f:
            output_f.write(input_f.read())

INFO 2018-10-08 12:17:30 +0100主副本0
  模块完成；打扫干净。 INFO 2018-10-08 12:17:30 +0100
  master-replica-0清理完成。错误2018-10-08   12:18:23 +0100服务复制主服务器0退出   非零状态1。错误2018-10-08 12:18:23 +0100
  服务回溯（最近一次拨打电话）：错误2018-10-08   12:18:23 +0100服务文件   _run_module_as_main中的“ /usr/lib/python3.5/runpy.py”，第184行，错误   2018-10-08 12:18:23 +0100服务“ 主要”，   mod_spec）错误2018-10-08 12:18:23 +0100服务
  _run_code错误中的文件“ /usr/lib/python3.5/runpy.py”，第85行，
  2018-10-08 12:18:23 +0100服务执行程序（代码，   run_globals）错误2018-10-08 12:18:23 +0100服务
  文件“ /root/.local/lib/python3.5/site-packages/trainer/model.py”，行   167，发生错误2018-10-08 12:18:23 +0100服务
  train_model（train_file = train_file，test_file = test_file，   job_dir = job_dir，** arguments）错误2018-10-08 12:18:23 +0100
  服务文件   “ /root/.local/lib/python3.5/site-packages/trainer/model.py”，第59行，   在train_model中出现错误2018-10-08 12:18:23 +0100服务
  模型= fit_model（模型，train_g，test_g，回调）错误
  2018-10-08 12:18:23 +0100服务文件   “ /root/.local/lib/python3.5/site-packages/trainer/model.py”，第124行，   in fit_model错误2018-10-08 12:18:23 +0100服务
  model.fit_generator（** params）错误2018-10-08 12:18:23 +0100
  服务档案   “ /usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/_impl/keras/engine/training.py”，   1598行，在fit_generator中出现错误2018-10-08 12:18:23 +0100
  服务initial_epoch = initial_epoch）错误2018-10-08   12:18:23 +0100服务文件   “ /usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/_impl/keras/engine/training_generator.py”，   在fit_generator中出现第231行错误2018-10-08 12:18:23 +0100
  服务回调.on_epoch_end（epoch，epoch_logs）错误
  2018-10-08 12:18:23 +0100服务文件   “ /usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/_impl/keras/callbacks.py”，   第95行，on_epoch_end错误2018-10-08 12:18:23 +0100
  服务callback.on_epoch_end（epoch，logs）错误
  2018-10-08 12:18:23 +0100服务文件   “ /usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/_impl/keras/callbacks.py”，   第468行，on_epoch_end错误2018-10-08 12:18:23 +0100
  服务self.model.save（filepath，overwrite = True）错误
  2018-10-08 12:18:23 +0100服务文件   “ /usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/_impl/keras/engine/network.py”，   第1126行，保存时发生错误2018-10-08 12:18:23 +0100服务
  save_model（自我，文件路径，覆盖，include_optimizer）错误
  2018-10-08 12:18:23 +0100服务文件   “ /usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/_impl/keras/engine/saving.py”，   第75行，在save_model中出现错误2018-10-08 12:18:23 +0100服务   引发ImportError（'save_model需要h5py。'）错误2018-10-08   12:18:23 +0100服务ImportError：save_model   需要h5py。

Answer 1

是的，您需要安装软件包h5py。

h5py文件是用于存储训练后的模型的容器。如果您尚未安装h5py软件包，则无法保存模型。

可以通过PyPI的pip安装预构建的h5py车轮

$ pip install h5py

无法在CloudML（1.8）错误上保存Keras检查点：ImportError：`save_model`需要h5py

1 个答案: