我正在创建一个执行某些数据库中预定义任务的dag。 执行完任务后,我将更新它们的执行时间,直到应该再次执行它们为止。每个任务的目的基本上是做sql单元测试。
到目前为止我要尝试的是
当前,它在第一次运行后失败。 Broken DAG: [/usr/local/airflow/src/dags/d06-query_validations/d06-query_validations_daily.py] list index out of range
显示的错误。
请帮我找出问题所在
到目前为止我尝试过的:
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2018, 11, 25, 8, 15),
'wait_for_downstream': True,
'email': email_list,
'email_on_failure': True,
'email_on_retry': False
}
def getValidationsToRun():
start_time = datetime.now()
conn = MySqlHook(mysql_conn_id='mysql_main', kwargs={"charset": "utf8"})
query = ReadTextFile('/d06-query_validations/get_validations.sql')
logging.log(logging.INFO, "Extract Query={}".format(query))
records = conn.get_pandas_df(query)
logging.log(logging.INFO, "Extract completed. it took:
{}".format(str(datetime.now() - start_time)))
return records
def create_subdag(parent_dag_name, child_dag_name, validation):
inner_dag = DAG(
%s.%s' % (parent_dag_name, child_dag_name),
default_args=default_args.copy(),
schedule_interval='@once'
)
QueryValidationFlow(
dag=inner_dag,
validation_name=validation.validationName,
title=validation.messageTemplate,
query=validation.query,
expected_result=validation.expectedResult,
source_db=validation.source,
emails=validation.emailRecipients.split(',')
)
return inner_dag
def create_subdag_operator(parent_dag, validation):
child_dag_name = 'subdag_{}'.format(validation.validationName)
parent_dag_name = parent_dag.dag_id
subdag = SubDagOperator(
task_id=child_dag_name,
dag=parent_dag,
subdag=create_subdag(parent_dag_name, child_dag_name, validation)
)
return subdag
def create_subdag_operators(parent_dag, validations):
subdag_list = [create_subdag_operator(parent_dag, row) for index, row in validations.iterrows()]
# chain subdag operators together
helpers.chain(*subdag_list)
return subdag_list
# (top-level) DAG & operators
dag = DAG(dag_id='d06-query_validations', schedule_interval='0 * * * *',
default_args=default_args, catchup=False)
curr_validations = getValidationsToRun()
curr_validation_ids = ",".join(["'%s'" % str(validationId) for validationId in curr_validations["validationId"]])
dummy_op_start = DummyOperator(task_id='d06-op_start', dag=dag)
subdag_ops = create_subdag_operators(dag, curr_validations)
update_execution_time = MySqlOperator(
task_id='d06-update_execution_time',
sql=ReadTextFile('/d06-
query_validations/update_validations.sql').format(curr_validation_ids),
mysql_conn_id='mysql_main',
retries=5,
execution_timeout=timedelta(minutes=2),
retry_delay=60,
dag=dag
)
dummy_op_start >> subdag_ops[0]
subdag_ops[-1] >> update_execution_time
答案 0 :(得分:0)
仅供参考,气流Web服务器和气流调度程序将循环执行DAG文件直接上下文中的所有内容,以确定DAG中的内容。即使DAG文件夹中的Python文件不会产生dag,也会发生这种情况。对于DAG没有时间表或已在UI或DB中禁用的DAG文件,也会发生这种情况。这样做是因为任何python文件可能都会动态生成新的DAG。
所以这经常运行:
def getValidationsToRun():
start_time = datetime.now()
conn = MySqlHook(mysql_conn_id='mysql_main', kwargs={"charset": "utf8"})
query = ReadTextFile('/d06-query_validations/get_validations.sql')
logging.log(logging.INFO, "Extract Query={}".format(query))
records = conn.get_pandas_df(query)
logging.log(logging.INFO, "Extract completed. it took:
{}".format(str(datetime.now() - start_time)))
return records
我确定您会在检查调度程序的日志。
我怀疑有时结果为空,所以subdag_ops[0]
超出范围。
也
sql=ReadTextFile(
'/d06-query_validations/update_validations.sql').format(curr_validation_ids),
表示您尚未阅读有关使用模板化字段和参数的信息。可能应该更像是:
sql='./d06-
query_validations/update_validations.sql',
params={'val_ids': curr_validation_ids},
其中包含{{ params.val_ids }}
的sql文件在其中。