如何运行创建Dataproc集群,运行作业,从云功能中删除集群

时间:2018-05-15 21:14:49

标签: google-cloud-dataproc

我想启动Dataproc作业以响应到达GCS存储桶的日志文件。我也不想保持持久集群的运行,因为新的日志文件每天只会到达几次,并且大部分时间都会空闲。

2 个答案:

答案 0 :(得分:2)

我可以使用WorkflowTemplate API为我管理群集生命周期。使用Dataproc Workflows,我不必轮询要创建的集群,创建作业或进行任何错误处理。

这是我的云功能。设置为Cloud Storage bucket以触发Finalize/Create事件:

index.js

exports.startWorkflow = (event, callback) => {

  const {google} = require('googleapis');

  const region = 'global'
  const zone = 'us-central1-a'
  const clusterName = 'my-cluster'

  const file = event.data;
  console.log("Event: ", file);

  if (!file.name) {
    throw "Skipped processing file!";
  }

  const queryFileUri = "gs://" + file.bucket + "/" + file.name

  console.log("Creating auth client: ");
  google.auth.getApplicationDefault(
    (err, authClient, projectId) => {
      if (authClient.createScopedRequired && authClient.createScopedRequired()) {
        authClient = authClient.createScoped([
          'https://www.googleapis.com/auth/cloud-platform',
          'https://www.googleapis.com/auth/userinfo.email'
        ]);
      }

      const request = {
        parent: "projects/" + projectId + "/regions/" + region,
        resource: {
          "placement": {
            "managedCluster": {
              "clusterName": clusterName,
              "config": {
                "gceClusterConfig": {
                  "zoneUri" : zone, // Can be omitted if using regional endpoint (like us-central1-a, not global)
                }
              }
            }
          },
          "jobs": [
            {
              "stepId": "step1",
              "pigJob": {
                "queryFileUri": queryFileUri,
              },
              "prerequisiteStepIds": [],
            }
          ]
        }
      };

      const dataproc = google.dataproc({ version: 'v1beta2', auth: authClient});
      dataproc.projects.regions.workflowTemplates.instantiateInline(
        request, (err, result) => {
          if (err) {
            throw err;
          }
          console.log(result);
          callback();
        });
    });
};

确保将功能设置为startWorkflow

package.json

{
  "name": "dataproc-workflow",
  "version": "1.0.0",
  "dependencies":{ "googleapis": "30.0.0"}
}

答案 1 :(得分:0)

您可以将Shell脚本或Docker RUN命令中的GCLOUD命令置于以下位置:

  1. 提供Dataproc集群
  2. 执行Spark作业
  3. 删除Dataproc群集(注意要删除的--quite或-q选项)

    Provision Dataproc集群:(耗时5分钟以上)

    gcloud dataproc集群创建devops-poc-dataproc-cluster-子网默认--zone us-central1-a --master-machine-type n1-standard-1 --master-boot-disk-size 200- num-workers 2-工人计算机类型n1-standard-2-工人引导磁盘大小200-映像版本1.3-deb9-项目gcp-project-212501-服务帐户=服务- id1@gcp-project-212501.iam.gserviceaccount.com

    提交Spark作业:

    sleep 60 && gcloud dataproc作业提交pyspark /dev_app/spark_poc/wordCountSpark.py --cluster = devops-poc-dataproc-cluster-gs:// gcp-project-212501-docker_bucket / input / gs:// gcp-project-212501-docker_bucket / output /

    删除Dataproc集群:

    gcloud dataproc集群删除-q devops-poc-dataproc-cluster