Question

我通过REST API提交培训工作。该过程能够训练，但当它到达保存部分时，它会导致错误The replica master 0 exited with a non-zero status of 1.错误。我已检查过服务帐户的IAM权限，并且具有以下权限：

Logs Writer
ML Engine Admin
Storage Admin
存储对象管理

这里是对实际错误的更深入追溯。

Traceback (most recent call last): 
File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main "__main__", mod_spec) 
File "/usr/lib/python3.5/runpy.py", line 85, in _run_code exec(code, run_globals) 
File "/root/.local/lib/python3.5/site-packages/trainer/task.py", line 223, in <module> dispatch(**parse_args.__dict__) 
File "/root/.local/lib/python3.5/site-packages/trainer/task.py", line 133, in dispatch callbacks=callbacks) 
File "/root/.local/lib/python3.5/site-packages/keras/legacy/interfaces.py", line 88, in wrapper return func(*args, **kwargs) 
File "/root/.local/lib/python3.5/site-packages/keras/models.py", line 1110, in fit_generator initial_epoch=initial_epoch) 
File "/root/.local/lib/python3.5/site-packages/keras/legacy/interfaces.py", line 88, in wrapper return func(*args, **kwargs) 
File "/root/.local/lib/python3.5/site-packages/keras/engine/training.py", line 1849, in fit_generator callbacks.on_epoch_begin(epoch) 
File "/root/.local/lib/python3.5/site-packages/keras/callbacks.py", line 63, in on_epoch_begin callback.on_epoch_begin(epoch, logs) 
File "/root/.local/lib/python3.5/site-packages/trainer/task.py", line 74, in on_epoch_begin copy_file_to_gcs(self.job_dir, checkpoints[-1]) 
File "/root/.local/lib/python3.5/site-packages/trainer/task.py", line 150, in copy_file_to_gcs output_f.write(input_f.read()) 
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/lib/io/file_io.py", line 126, in read pywrap_tensorflow.ReadFromStream(self._read_buf, length, status)) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/lib/io/file_io.py", line 94, in _prepare_value return compat.as_str_any(val) 
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/compat.py", line 106, in as_str_any return as_str(value) 
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/compat.py", line 84, in as_text return bytes_or_text.decode(encoding) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte

我不完全确定为什么会发生这种情况。代码取自Googles git页面上的示例项目。没有任何改变。这是我的REST电话：

{
    "jobId": "training_20",
    "trainingInput": {
        "scaleTier": "BASIC",
        "packageUris": ["gs://MY_BUCKET/census.tar.gz"],
        "pythonModule": "trainer.task",
        "args": [
          "--train-files", 
          "gs://MY_BUCKET/adult.data.csv", 
          "--eval-files", 
          "gs://MY_BUCKET/adult.test.csv", 
          "--job-dir", 
          "gs://MY_BUCKET/models", 
          "--train-steps",
          "100",
          "--eval-steps",
          "10"],
        "region": "europe-west1",
        "jobDir": "gs://MY_BUCKET/models",
        "runtimeVersion": "1.4",
        "pythonVersion": "3.5"
    }
}

这是保存代码部分：

# Unhappy hack to work around h5py not being able to write to GCS.
# Force snapshots and saves to local filesystem, then copy them over to GCS.
  if job_dir.startswith("gs://"):
    census_model.save(CENSUS_MODEL)
    copy_file_to_gcs(job_dir, CENSUS_MODEL)
  else:
    census_model.save(os.path.join(job_dir, CENSUS_MODEL))

  # Convert the Keras model to TensorFlow SavedModel
  model.to_savedmodel(census_model, os.path.join(job_dir, 'export'))

# h5py workaround: copy local models over to GCS if the job_dir is GCS.
def copy_file_to_gcs(job_dir, file_path):
  with file_io.FileIO(file_path, mode='r') as input_f:
    with file_io.FileIO(os.path.join(job_dir, file_path), mode='w+') as output_f:
        output_f.write(input_f.read())

Answer 1

经过一些进一步的研究，谷歌决定如何保存文件似乎是一个问题。最初，它表示类型为r：如此处所示...... with file_io.FileIO(file_path, mode='r') as input_f:。通过将模式更改为rb（二进制），可以解决问题。

当模式设置为r时，python尝试将此字节数组（假设它的utf-8）转换为unicode字符串。虽然，当它遇到字节序列0x89 in position 0: invalid start byte时，它不遵循utf8约定，因此崩溃。 Alfe发布了一个更深入的回复：

error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

Google ML Engine：无法保存模型

1 个答案: