我通过REST API提交培训工作。该过程能够训练,但当它到达保存部分时,它会导致错误The replica master 0 exited with a non-zero status of 1.
错误。我已检查过服务帐户的IAM权限,并且具有以下权限:
这里是对实际错误的更深入追溯。
Traceback (most recent call last):
File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main "__main__", mod_spec)
File "/usr/lib/python3.5/runpy.py", line 85, in _run_code exec(code, run_globals)
File "/root/.local/lib/python3.5/site-packages/trainer/task.py", line 223, in <module> dispatch(**parse_args.__dict__)
File "/root/.local/lib/python3.5/site-packages/trainer/task.py", line 133, in dispatch callbacks=callbacks)
File "/root/.local/lib/python3.5/site-packages/keras/legacy/interfaces.py", line 88, in wrapper return func(*args, **kwargs)
File "/root/.local/lib/python3.5/site-packages/keras/models.py", line 1110, in fit_generator initial_epoch=initial_epoch)
File "/root/.local/lib/python3.5/site-packages/keras/legacy/interfaces.py", line 88, in wrapper return func(*args, **kwargs)
File "/root/.local/lib/python3.5/site-packages/keras/engine/training.py", line 1849, in fit_generator callbacks.on_epoch_begin(epoch)
File "/root/.local/lib/python3.5/site-packages/keras/callbacks.py", line 63, in on_epoch_begin callback.on_epoch_begin(epoch, logs)
File "/root/.local/lib/python3.5/site-packages/trainer/task.py", line 74, in on_epoch_begin copy_file_to_gcs(self.job_dir, checkpoints[-1])
File "/root/.local/lib/python3.5/site-packages/trainer/task.py", line 150, in copy_file_to_gcs output_f.write(input_f.read())
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/lib/io/file_io.py", line 126, in read pywrap_tensorflow.ReadFromStream(self._read_buf, length, status)) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/lib/io/file_io.py", line 94, in _prepare_value return compat.as_str_any(val)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/compat.py", line 106, in as_str_any return as_str(value)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/compat.py", line 84, in as_text return bytes_or_text.decode(encoding) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte
我不完全确定为什么会发生这种情况。代码取自Googles git页面上的示例项目。没有任何改变。这是我的REST电话:
{
"jobId": "training_20",
"trainingInput": {
"scaleTier": "BASIC",
"packageUris": ["gs://MY_BUCKET/census.tar.gz"],
"pythonModule": "trainer.task",
"args": [
"--train-files",
"gs://MY_BUCKET/adult.data.csv",
"--eval-files",
"gs://MY_BUCKET/adult.test.csv",
"--job-dir",
"gs://MY_BUCKET/models",
"--train-steps",
"100",
"--eval-steps",
"10"],
"region": "europe-west1",
"jobDir": "gs://MY_BUCKET/models",
"runtimeVersion": "1.4",
"pythonVersion": "3.5"
}
}
这是保存代码部分:
# Unhappy hack to work around h5py not being able to write to GCS.
# Force snapshots and saves to local filesystem, then copy them over to GCS.
if job_dir.startswith("gs://"):
census_model.save(CENSUS_MODEL)
copy_file_to_gcs(job_dir, CENSUS_MODEL)
else:
census_model.save(os.path.join(job_dir, CENSUS_MODEL))
# Convert the Keras model to TensorFlow SavedModel
model.to_savedmodel(census_model, os.path.join(job_dir, 'export'))
# h5py workaround: copy local models over to GCS if the job_dir is GCS.
def copy_file_to_gcs(job_dir, file_path):
with file_io.FileIO(file_path, mode='r') as input_f:
with file_io.FileIO(os.path.join(job_dir, file_path), mode='w+') as output_f:
output_f.write(input_f.read())
答案 0 :(得分:4)
经过一些进一步的研究,谷歌决定如何保存文件似乎是一个问题。最初,它表示类型为r
:如此处所示...... with file_io.FileIO(file_path, mode='r') as input_f:
。通过将模式更改为rb
(二进制),可以解决问题。
当模式设置为r
时,python尝试将此字节数组(假设它的utf-8)转换为unicode字符串。虽然,当它遇到字节序列0x89 in position 0: invalid start byte
时,它不遵循utf8约定,因此崩溃。 Alfe发布了一个更深入的回复:
error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte