我在DAG中使用EMR CreateJobFlow,AddSteps,StepSensor和TerminateJobFlow运算符来启动EMR集群,添加步骤(2个spark应用程序和dist-cp),并在所有步骤完成或失败1次时终止。当我有2个步骤的DAG(第一个是Spark应用程序,第二个是dist-cp)时,我能够执行此操作,但是,当我有两个步骤DAG时,群集成功运行了第一个步骤,并且终止而未继续进行操作第二步和第三步。
通过一些挖掘,我可以看到气流“戳”步骤以查看它们是否仍在运行。在这种情况下,似乎只在完成1步后才认为它“成功”。
我的spark应用程序非常简单。第一个创建数据帧并将其写入本地HDFS。第二个从HDFS读取数据,然后加入另一个数据集,然后写回到HDFS。第三步是s3-dist-cp,用于将数据从HDFS复制到s3。这3个步骤均可在Spark-Shell中成功运行,也可以作为Spark-Submit作业成功运行。我还亲自克隆了EMR群集(没有气流),并看到所有步骤成功完成而没有任何错误,因此,这里不是EMR和Spark的问题。
DAG在下面
from datetime import timedelta
import airflow
from airflow import DAG
from airflow.contrib.operators.emr_create_job_flow_operator \
import EmrCreateJobFlowOperator
from airflow.contrib.operators.emr_add_steps_operator \
import EmrAddStepsOperator
from airflow.contrib.sensors.emr_step_sensor import EmrStepSensor
from airflow.contrib.operators.emr_terminate_job_flow_operator \
import EmrTerminateJobFlowOperator
DEFAULT_ARGS = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': airflow.utils.dates.days_ago(2)
}
SPARK_TEST_STEPS = [
{
'Name': 'monthly_agg',
'ActionOnFailure': 'CONTINUE',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': ['spark-submit',
'--deploy-mode',
'cluster',
'--class' ,
'AggApp',
's3://jar1.jar' ]
}
},
{
'Name': 'monthly_agg2',
'ActionOnFailure': 'CONTINUE',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': ['spark-submit',
'--deploy-mode',
'cluster',
'--class' ,
'SimpleApp',
's3:/jar2.jar' ]
}
},
{
'Name': 'copy-data',
'ActionOnFailure': 'CONTINUE',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': ['s3-dist-cp',
'--src',
'/tempo',
'--dest',
's3://mydata/'
]
}
}
]
JOB_FLOW_OVERRIDES = {
'Instances': {'Ec2SubnetId': 'subnet-mysubnetid',
'InstanceGroups': [
{
'Name': 'Master nodes',
'Market': 'ON_DEMAND',
'InstanceRole': 'MASTER',
'InstanceType': 'r4.2xlarge',
'InstanceCount': 1
},
{
'Name': 'Slave nodes',
'Market': 'ON_DEMAND',
'InstanceRole': 'CORE',
'InstanceType': 'r4.2xlarge',
'InstanceCount': 8,
'EbsConfiguration': {'EbsBlockDeviceConfigs':[{'VolumeSpecification':{'SizeInGB':128,'VolumeType':'gp2'},'VolumesPerInstance':1}],'EbsOptimized':True}
}
]},
'Name':'airflow-monthly_agg_custom',
'Configurations': [
{
'Classification':'spark-defaults','Properties':
{'spark.sql.crossJoin.enabled':'true',
'spark.serializer':'org.apache.spark.serializer.KryoSerializer',
'spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version':'2',
"maximizeResourceAllocation":"true"},
'Configurations':[]
},
{
'Classification':'spark-hive-site','Properties':
{'hive.metastore.client.factory.class':'com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory'},
'Configurations':[]
}
]}
dag = DAG(
'monthly_agg_custom',
default_args=DEFAULT_ARGS,
dagrun_timeout=timedelta(hours=4),
schedule_interval='@once'
)
cluster_creator = EmrCreateJobFlowOperator(
task_id='create_job_flow',
job_flow_overrides=JOB_FLOW_OVERRIDES,
aws_conn_id='aws_default',
emr_conn_id='emr_default',
dag=dag
)
step_adder = EmrAddStepsOperator(
task_id='add_steps',
job_flow_id="{{ task_instance.xcom_pull('create_job_flow', key='return_value') }}",
aws_conn_id='aws_default',
steps=SPARK_TEST_STEPS,
dag=dag
)
step_checker = EmrStepSensor(
task_id='watch_step',
job_flow_id="{{ task_instance.xcom_pull('create_job_flow', key='return_value') }}",
step_id="{{ task_instance.xcom_pull('add_steps', key='return_value')[0] }}",
aws_conn_id='aws_default',
dag=dag
)
cluster_remover = EmrTerminateJobFlowOperator(
task_id='remove_cluster',
job_flow_id="{{ task_instance.xcom_pull('create_job_flow', key='return_value') }}",
aws_conn_id='aws_default',
dag=dag
)
cluster_creator.set_downstream(step_adder)
step_adder.set_downstream(step_checker)
step_checker.set_downstream(cluster_remover)
答案 0 :(得分:0)
问题是您将EmrStepSensor
的所有步骤作为一个stepadder
输入,因此一旦完成,它将终止集群。
解决方案是分离所有步骤,并将最后的步骤ID赋予EmrStepSensor
。另外,您只能将最后一步与其他步骤加法器(step_adder_actual_step
)分开,并将其提供给EmrStepSensor
step_adder_pre_step = EmrAddStepsOperator(
task_id='pre_step',
job_flow_id="{{ task_instance.xcom_pull('create_job_flow2', key='return_value') }}",
aws_conn_id='aws_default',
steps=pre_step,
dag=dag
)
step_adder_actual_step = EmrAddStepsOperator(
task_id='actual_step',
job_flow_id="{{ task_instance.xcom_pull('create_job_flow2', key='return_value') }}",
aws_conn_id='aws_default',
steps=actual_step,
dag=dag
)
step_checker = EmrStepSensor(
task_id='watch_step',
job_flow_id="{{ task_instance.xcom_pull('create_job_flow2', key='return_value') }}",
step_id="{{ task_instance.xcom_pull('actual_step', key='return_value')[0] }}",
aws_conn_id='aws_default',
dag=dag
)
cluster_remover = EmrTerminateJobFlowOperator(
task_id='remove_cluster',
job_flow_id="{{ task_instance.xcom_pull('create_job_flow2', key='return_value') }}",
aws_conn_id='aws_default',
dag=dag
)
cluster_creator >> step_adder_pre_step >> step_adder_actual_step >> step_checker >> cluster_remover