气流 - 破碎的DAG - 超时

时间:2018-05-24 09:17:26

标签: airflow airflow-scheduler

我有一个DAG,它执行一个连接到Postgres DB的函数,删除表中的内容,然后插入一个新的数据集。

我在我的本地尝试这个,我看到当我尝试运行它时,Web服务器需要很长时间才能连接,并且在大多数情况下并不成功。但是,作为连接过程的一部分,它似乎是从后端执行查询。由于我有一个删除功能,我看到数据从表中删除(基本上其中一个函数被执行),即使我没有安排脚本或手动启动。有人可以建议我在这方面做错了什么。

UI中弹出的一个错误是

破坏的DAG:[/ Users / user / airflow / dags / dwh_sample23.py]超时

另请参阅用户界面中dag id旁边的i,其中显示T 他的DAG在网络服务器的DAG对象中无法使用。 以下是我正在使用的代码:

## Third party Library Imports
import pandas as pd
import psycopg2
import airflow
from airflow import DAG
from airflow.operators import BashOperator
from datetime import datetime, timedelta
from sqlalchemy import create_engine
import io


# Following are defaults which can be overridden later on
default_args = {
'owner': 'admin',
'depends_on_past': False,
'start_date': datetime(2018, 5, 21),
'retries': 1,
'retry_delay': timedelta(minutes=1),
}

dag = DAG('dwh_sample23', default_args=default_args)


#######################
## Login to DB

def db_login():
    ''' This function connects to the Data Warehouse and returns the cursor to execute queries '''
global dwh_connection
try:
    dwh_connection = psycopg2.connect(" dbname = 'dbname' user = 'user' password = 'password' host = 'hostname' port = '5439' sslmode = 'require' ")
except:
    print("I am unable to connect to the database.")
print('Success')
return(dwh_connection)

def tbl1_del():
''' This function takes clears all rows from tbl1 '''
cur = dwh_connection.cursor()
cur.execute("""DELETE FROM tbl1;""")
dwh_connection.commit()


def pop_tbl1():
''' This function populates all rows in tbl1 '''
cur = dwh_connection.cursor()
cur.execute(""" INSERT INTO tbl1
select id,name,price from tbl2;""")
dwh_connection.commit()



db_login()
tbl1_del()
pop_tbl1()
dwh_connection.close()

##########################################


t1 = BashOperator(
task_id='DB_Connect',
python_callable=db_login(),
bash_command='python3 ~/airflow/dags/dwh_sample23.py',
dag=dag)

t2 = BashOperator(
task_id='del',
python_callable=tbl1_del(),
bash_command='python3 ~/airflow/dags/dwh_sample23.py',
dag=dag)


t3 = BashOperator(
task_id='populate',
python_callable=pop_tbl1(),
bash_command='python3 ~/airflow/dags/dwh_sample23.py',
dag=dag)


t1.set_downstream(t2)
t2.set_downstream(t3)

有人可以帮忙吗?感谢。

2 个答案:

答案 0 :(得分:1)

您可以使用BashOperator而不是使用PythonOperator,并在db_login()

中呼叫tbl1_del()pop_tbl1()PythonOperator
## Third party Library Imports
import pandas as pd
import psycopg2
import airflow
from airflow import DAG
from airflow.operators import PythonOperator
from datetime import datetime, timedelta
from sqlalchemy import create_engine
import io


# Following are defaults which can be overridden later on
default_args = {
'owner': 'admin',
'depends_on_past': False,
'start_date': datetime(2018, 5, 21),
'retries': 1,
'retry_delay': timedelta(minutes=1),
}

dag = DAG('dwh_sample23', default_args=default_args)


#######################
## Login to DB

def db_login():
    ''' This function connects to the Data Warehouse and returns the cursor to execute queries '''
global dwh_connection
try:
    dwh_connection = psycopg2.connect(" dbname = 'dbname' user = 'user' password = 'password' host = 'hostname' port = '5439' sslmode = 'require' ")
except:
    print("I am unable to connect to the database.")
print('Success')
return(dwh_connection)

def tbl1_del():
''' This function takes clears all rows from tbl1 '''
cur = dwh_connection.cursor()
cur.execute("""DELETE FROM tbl1;""")
dwh_connection.commit()


def pop_tbl1():
''' This function populates all rows in tbl1 '''
cur = dwh_connection.cursor()
cur.execute(""" INSERT INTO tbl1
select id,name,price from tbl2;""")
dwh_connection.commit()



db_login()
tbl1_del()
pop_tbl1()
dwh_connection.close()

##########################################


t1 = PythonOperator(
task_id='DB_Connect',
python_callable=db_login(),
dag=dag)

t2 = PythonOperator(
task_id='del',
python_callable=tbl1_del(),
dag=dag)


t3 = PythonOperator(
task_id='populate',
python_callable=pop_tbl1(),
dag=dag)


t1.set_downstream(t2)
t2.set_downstream(t3)

答案 1 :(得分:0)

这已经很老了,但是我们在 prod 中遇到了这个错误,我发现了这个问题,并且认为它有答案很好。

某些代码在 DAG 加载期间执行,即您实际运行

db_login()
tbl1_del()
pop_tbl1()
dwh_connection.close()
##########################################

在网络服务器和调度程序循环中,当它们从文件加载 dag 定义时。 我相信你不是故意的。 如果您只删除这 4 行,一切应该都可以正常工作。

通常不要将您希望执行器执行的函数放在文件/模块级别,因为当调度程序/网络服务器的解释器加载文件以获取 dag 定义时,它会调用它们。

尝试将其放入您的 dag 文件中,然后查看检查网络服务器日志以了解会发生什么。

from time import sleep
def do_some_printing():
    print(1111111)
    sleep(60)

do_some_printing()