我想启动Dataproc作业以响应到达GCS存储桶的日志文件。我也不想保持持久集群的运行,因为新的日志文件每天只会到达几次,并且大部分时间都会空闲。
答案 0 :(得分:2)
我可以使用WorkflowTemplate API为我管理群集生命周期。使用Dataproc Workflows,我不必轮询要创建的集群,创建作业或进行任何错误处理。
这是我的云功能。设置为Cloud Storage bucket
以触发Finalize/Create
事件:
index.js
:
exports.startWorkflow = (event, callback) => {
const {google} = require('googleapis');
const region = 'global'
const zone = 'us-central1-a'
const clusterName = 'my-cluster'
const file = event.data;
console.log("Event: ", file);
if (!file.name) {
throw "Skipped processing file!";
}
const queryFileUri = "gs://" + file.bucket + "/" + file.name
console.log("Creating auth client: ");
google.auth.getApplicationDefault(
(err, authClient, projectId) => {
if (authClient.createScopedRequired && authClient.createScopedRequired()) {
authClient = authClient.createScoped([
'https://www.googleapis.com/auth/cloud-platform',
'https://www.googleapis.com/auth/userinfo.email'
]);
}
const request = {
parent: "projects/" + projectId + "/regions/" + region,
resource: {
"placement": {
"managedCluster": {
"clusterName": clusterName,
"config": {
"gceClusterConfig": {
"zoneUri" : zone, // Can be omitted if using regional endpoint (like us-central1-a, not global)
}
}
}
},
"jobs": [
{
"stepId": "step1",
"pigJob": {
"queryFileUri": queryFileUri,
},
"prerequisiteStepIds": [],
}
]
}
};
const dataproc = google.dataproc({ version: 'v1beta2', auth: authClient});
dataproc.projects.regions.workflowTemplates.instantiateInline(
request, (err, result) => {
if (err) {
throw err;
}
console.log(result);
callback();
});
});
};
确保将功能设置为startWorkflow
。
package.json
{
"name": "dataproc-workflow",
"version": "1.0.0",
"dependencies":{ "googleapis": "30.0.0"}
}
答案 1 :(得分:0)
您可以将Shell脚本或Docker RUN命令中的GCLOUD命令置于以下位置:
删除Dataproc群集(注意要删除的--quite或-q选项)
gcloud dataproc集群创建devops-poc-dataproc-cluster-子网默认--zone us-central1-a --master-machine-type n1-standard-1 --master-boot-disk-size 200- num-workers 2-工人计算机类型n1-standard-2-工人引导磁盘大小200-映像版本1.3-deb9-项目gcp-project-212501-服务帐户=服务- id1@gcp-project-212501.iam.gserviceaccount.com
sleep 60 && gcloud dataproc作业提交pyspark /dev_app/spark_poc/wordCountSpark.py --cluster = devops-poc-dataproc-cluster-gs:// gcp-project-212501-docker_bucket / input / gs:// gcp-project-212501-docker_bucket / output /
gcloud dataproc集群删除-q devops-poc-dataproc-cluster