使用AWS Lambda(NodeJS)培训和部署SageMaker ML模型

时间:2019-07-11 02:38:14

标签: amazon-web-services aws-lambda amazon-sagemaker aws-sdk-js

我正在使用AWS Lambda(NodeJS)创建一个sagemaker培训工作,并使用Sagemaker Javascript SDK进行部署。

我正在关注以下AWS JavaScript SDK文档

https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/SageMaker.html

我正在使用以下脚本创建培训工作。

Create Training Job:
=====================

    let TrainingJobName = 'Training-' + curr_date_time
    let TrainingImage   = 'XXXXXX.dkr.ecr.us-east-1.amazonaws.com/xxxx:latest'
    let S3Uri           = 's3://xxx.xxxx.sagemaker/csv'

    console.log(`TrainingJobName: ${TrainingJobName}`);

    let params = {
        AlgorithmSpecification: { /* required */
            TrainingInputMode: 'File', /* required */
            TrainingImage: TrainingImage
        },
        OutputDataConfig: { /* required */
            S3OutputPath: 's3://xxx.xxxx.sagemaker/xxxx/output', /* required */
        },
        ResourceConfig: { /* required */
            InstanceCount: 1, /* required */
            InstanceType: 'ml.m4.xlarge', /* required */
            VolumeSizeInGB: 1, /* required */
        },
        RoleArn: 'arn:aws:iam::xxxxx:role/service-role/AmazonSageMaker-ExecutionRole-xxxx', /* required */
        StoppingCondition: { /* required */
            MaxRuntimeInSeconds: 86400
        },
        TrainingJobName: TrainingJobName, /* required */
        InputDataConfig: [
            {
                ChannelName: 'training', /* required */
                DataSource: { /* required */
                    S3DataSource: {
                        S3DataType: 'S3Prefix', /* required */
                        S3Uri: S3Uri, /* required */
                        S3DataDistributionType: 'FullyReplicated'
                    }
                },
                CompressionType: null,
                ContentType: '',
                RecordWrapperType: null,
            }
        ]
    };

    return await sagemaker.createTrainingJob(params).promise();

创建训练作业后,我使用sagemaker describeTrainingJob函数查询作业状态。 我的状态为“进行中”

此后,我使用以下方法调用sagemaker waitFor函数以等待训练作业的完成:

https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/SageMaker.html#trainingJobCompletedOrStopped-waiter

let waitFor_result = await sagemaker.waitFor('trainingJobCompletedOrStopped', {TrainingJobName: training_job_name}).promise();
console.log(`waitFor_result : ${JSON.stringify(waitFor_result)}`);

我发现sagemaker waitFor在完成第一个培训作业之前创建了第二个培训作业,然后继续创建具有相同作业名称的后续培训作业。

enter image description here

我认为这是由于createTrainingJob函数中的StoppingCondition参数(MaxRuntimeInSeconds:86400)造成的。

我想知道是否有解决方案可以创建一个培训工作,并在完成培训工作后返回结果?

================================================ =========== 更新:

我正在遵循“计划具有Lambda函数的SageMaker模型的训练” https://www.youtube.com/watch?v=FJaykbAtGTM

如果我在lambda函数中使用以下代码,则可以创建培训工作。

let training_job_result = await start_model_training();
console.log(`Sagemaker training result : ${JSON.stringify(training_job_result)}`);

let training_job_arn = training_job_result["TrainingJobArn"];
let training_job_name = training_job_arn.split("/")[1];


let desc_training_job = await sagemaker.describeTrainingJob({TrainingJobName: training_job_name}).promise();
let desc_status = desc_training_job["TrainingJobStatus"];
console.log(`Training job desc_status 1 : ${JSON.stringify(desc_status)}`);

但是我需要等到培训工作完成后,再调用sagemaker deploy方法来创建/更新端点。

如果我使用以下代码,则它将继续创建多个训练作业,并且lambda函数永远不会终止。

let waitFor_result = await sagemaker.waitFor('trainingJobCompletedOrStopped', {TrainingJobName: training_job_name}).promise();
console.log(`waitFor_result : ${JSON.stringify(waitFor_result)}`);


desc_training_job = await sagemaker.describeTrainingJob({TrainingJobName: training_job_name}).promise();
desc_status = desc_training_job["TrainingJobStatus"];
console.log(`Training job desc_status 2 : ${JSON.stringify(desc_status)}`);

培训结束后,我想部署/更新端点。

1 个答案:

答案 0 :(得分:0)

我对JS不太了解,但是您是否不应该传递正在等待的培训工作的ID?根据文档,这是正确的用法:

var params = {
  TrainingJobName: 'STRING_VALUE' /* required */
};
sagemaker.waitFor('trainingJobCompletedOrStopped', params, function(err, data) {
  if (err) console.log(err, err.stack); // an error occurred
  else     console.log(data);           // successful response
});
相关问题