如何使用Airflow DataprocOperator在Google DataProc群集上运行Shell脚本

时间:2019-02-01 13:37:24

标签: google-cloud-platform airflow google-cloud-dataproc

设置群集后,我正在尝试在Dataproc群集中运行Shell脚本。我不确定还是不确定要在群集启动并运行后触发.sh文件的参数传递给操作员。

用于创建集群的示例气流代码:

create_cluster = DataprocClusterCreateOperator(
    task_id='create_dataproc_cluster',
    cluster_name=DAG_CONFIG['DATAPROC']['cluster_name'],
    project_id=DAG_CONFIG['PROJECT_ID'],
    num_workers=DAG_CONFIG['DATAPROC']['num_workers'],
    zone=DAG_CONFIG['DATAPROC']['zone'],
    subnetwork_uri=DAG_CONFIG['DATAPROC']['subnetwork_uri'],
    master_machine_type='n1-standard-1',
    master_disk_type='pd-standard',
    master_disk_size=50,
    worker_machine_type='n1-standard-1',
    worker_disk_type='pd-standard',
    worker_disk_size=50,
    auto_delete_ttl=DAG_CONFIG['DATAPROC']['auto_delete_ttl'],
    storage_bucket=DAG_CONFIG['GCS_STAGING']['bucket_name'],
    dag=DAG_ID)

这是我需要通过DataprocHadoopOperator或任何合适的运算符提交shell脚本的地方。

Shell_Task = DataProcHadoopOperator(
    task_id='shell_Submit',
    main_jar='???',
    project_id='xxx',
    arguments= [??],
    job_name='{{task.task_id}}_{{ds_nodash}}',
    cluster_name=DAG_CONFIG['DATAPROC']['cluster_name'],
    gcp_conn_id='google_cloud_default',
    region=DAG_CONFIG['DATAPROC']['zone'],
    dag=DAG_ID)

任何帮助将不胜感激。

1 个答案:

答案 0 :(得分:1)

要在集群创建期间在每个Dataproc VM上运行Shell脚本,应使用Dataproc Initialization actions

您可以通过DataprocClusterCreateOperator指定它们:

DataprocClusterCreateOperator(
    # ...
    init_actions_uris = ['gs://<BUCKET>/path/to/init/action.sh'],
    # ...
)