我有一个定义DAG对象的文件:
dags / my_dag.py
from airflow import DAG
from datetime import datetime
default_args = {
'owner': 'pilota',
'depends_on_past': False,
'start_date': datetime(2019, 10, 1),
'email': ['some@email.com'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 0,
}
bts_dag = DAG(
'hist_data_etl', default_args=default_args, schedule_interval='@once')
然后在另一个文件中,导入创建的dag并定义我的任务:
from ingestion.airflow_home.dags.my_dag import bts_dag
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from ingestion.datatransformer import fetch_and_transform_bts_data_col
NUM_ENGINES = 4
template_command = '''
ipcluster start n {{ params.cluster }}
sleep 5
'''
start_iparallel_cluster = BashOperator(
task_id='start_cluster',
bash_command=template_command,
retries=3,
params={'params': NUM_ENGINES},
dag=bts_dag)
import_hist_bts_data_task = PythonOperator(
task_id='fetch_transform_hist_col',
python_callable=fetch_and_transform_bts_data_col,
op_kwargs={
'bucket': 'some-bucket', 'path': 'hello/', 'num_files': 1
},
dag=bts_dag)
start_iparallel_cluster >> import_hist_bts_data_task
健全性检查:
$ airflow list_dags
产量:
-------------------------------------------------------------------
DAGS
-------------------------------------------------------------------
hist_data_etl
但是
$ airflow list_tasks hist_data_etl
不输出我的任何任务。气流以某种方式没有注册属于我在另一个文件中定义的DAG的任务。
请帮助:)
答案 0 :(得分:2)
但是,只要对方法进行一些修改,就可以使此工作正常
第一个文件
# dag_object_creator.py
from airflow import DAG
from datetime import datetime
default_args = {
'owner': 'pilota',
'depends_on_past': False,
'start_date': datetime(2019, 10, 1),
'email': ['some@email.com'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 0,
}
def create_dag_object():
bts_dag = DAG(dag_id='hist_data_etl',
default_args=default_args,
schedule_interval='@once')
return bts_dag
第二个文件
# tasks_creator.py
# this import statement is problematic
# from ingestion.airflow_home.dags.my_dag import bts_dag
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from ingestion.datatransformer import fetch_and_transform_bts_data_col
NUM_ENGINES = 4
template_command = '''
ipcluster start n {{ params.cluster }}
sleep 5
'''
def create_bash_task(bts_dag):
start_iparallel_cluster = BashOperator(
task_id='start_cluster',
bash_command=template_command,
retries=3,
params={'params': NUM_ENGINES},
dag=bts_dag)
return start_iparallel_cluster
def create_python_task(bts_dag):
import_hist_bts_data_task = PythonOperator(
task_id='fetch_transform_hist_col',
python_callable=fetch_and_transform_bts_data_col,
op_kwargs={
'bucket': 'pilota-ml-raw-store', 'path': 'flights/', 'num_files': 1
},
dag=bts_dag)
return import_hist_bts_data_task
第三档
# dag_definition_file.py
import dag_object_creator
import tasks_creator
# create dag object
# stuff from 'dag_object_creator.py' can be put here directly,
# i just broke down things for clarity
bts_dag = dag_object_creator.create_dag_object()
# create tasks
start_iparallel_cluster = tasks_creator.create_bash_task(bts_dag)
import_hist_bts_data_task = tasks_creator.create_python_task(bts_dag)
# chaining tasks
start_iparallel_cluster >> import_hist_bts_data_task
上面的代码布局将强制执行以下行为
前期过程仅开始解析dag_definition_file.py
(由于在全局范围内未创建“ DAG”,因此跳过了另外两个文件)
在执行import
语句时,将解析这些文件
在执行dag /任务创建语句时,分别在全局范围内创建DAG和任务对象
因此一切都很好,并且该实现应该有效(未经测试,但基于传闻知识)
建议阅读