我正在使用setuptools手动构建我的教练包,并将分发(.tar.gz)上传到GCP中的存储桶(按照说明here)。然后,我创建了一个服务帐户并授予ML Engine Developer和Storage Object Admin角色。
我尝试通过从我的python代码调用API来运行培训包。代码完全遵循here所写的内容。但是,我收到一条错误消息: HttpError 500 - 遇到内部错误。不幸的是,错误太通用了,不允许我调试,并且没有在控制台中创建ML作业。
from oauth2client.client import GoogleCredentials
from oauth2client.service_account import ServiceAccountCredentials
from googleapiclient import discovery
training_inputs = {'scaleTier': 'STANDARD_1',
'packageUris': ['gs://mybucket/trainerPackage-0.1.tar.gz'],
'pythonModule': 'trainer.task',
'args': ['--data-dir', 'gs://mybucket/data/evaluate_data.csv',
'--job-dir', 'gs://mybucket/output',
'--epochs', '10'
],
'region': 'asia-east1',
'jobDir': 'gs://mybucket/output',
'runtimeVersion': '1.2'}
job_spec = {'jobId':'ml_test', 'trainingInput':training_inputs}
project_name = 'test-proj'
project_id = 'projects/{}'.format(project_name)
credentials_dict = {my_cred_from_json}
credentials = ServiceAccountCredentials.from_json_keyfile_dict(credentials_dict)
cloudml = discovery.build('ml', 'v1', credentials=credentials)
request = cloudml.projects().jobs().create(body=job_spec,
parent=project_id)
try:
response = request.execute()
except Exception as err:
print('There was an error creating the training job.'
' Check the details:')
print(err._get_reason())
print(str(err))
以下是我的教练包的结构:
trainerPackage
- trainer
- __init__.py
- task.py
- setup.py
这是我的setup.py:
from setuptools import setup, find_packages
REQUIRED_PACKAGES = [
'tensorflow >= 1.2',
'pandas >= 0.20.3',
'numpy >= 1.13.3'
]
setup(
name = 'trainerPackage',
version = '0.1',
install_requires = REQUIRED_PACKAGES,
packages = find_packages(),
include_package_data = True,
description = 'This is an example trainer package.'
)
我已经确认我的教练代码工作正常。它在本地计算机上成功运行,并使用gcloud ml-engine jobs submit training
。