我有一个DAG,它执行一个连接到Postgres DB的函数,删除表中的内容,然后插入一个新的数据集。
我在我的本地尝试这个,我看到当我尝试运行它时,Web服务器需要很长时间才能连接,并且在大多数情况下并不成功。但是,作为连接过程的一部分,它似乎是从后端执行查询。由于我有一个删除功能,我看到数据从表中删除(基本上其中一个函数被执行),即使我没有安排脚本或手动启动。有人可以建议我在这方面做错了什么。
UI中弹出的一个错误是
破坏的DAG:[/ Users / user / airflow / dags / dwh_sample23.py]超时
另请参阅用户界面中dag id旁边的i,其中显示T 他的DAG在网络服务器的DAG对象中无法使用。 以下是我正在使用的代码:
## Third party Library Imports
import pandas as pd
import psycopg2
import airflow
from airflow import DAG
from airflow.operators import BashOperator
from datetime import datetime, timedelta
from sqlalchemy import create_engine
import io
# Following are defaults which can be overridden later on
default_args = {
'owner': 'admin',
'depends_on_past': False,
'start_date': datetime(2018, 5, 21),
'retries': 1,
'retry_delay': timedelta(minutes=1),
}
dag = DAG('dwh_sample23', default_args=default_args)
#######################
## Login to DB
def db_login():
''' This function connects to the Data Warehouse and returns the cursor to execute queries '''
global dwh_connection
try:
dwh_connection = psycopg2.connect(" dbname = 'dbname' user = 'user' password = 'password' host = 'hostname' port = '5439' sslmode = 'require' ")
except:
print("I am unable to connect to the database.")
print('Success')
return(dwh_connection)
def tbl1_del():
''' This function takes clears all rows from tbl1 '''
cur = dwh_connection.cursor()
cur.execute("""DELETE FROM tbl1;""")
dwh_connection.commit()
def pop_tbl1():
''' This function populates all rows in tbl1 '''
cur = dwh_connection.cursor()
cur.execute(""" INSERT INTO tbl1
select id,name,price from tbl2;""")
dwh_connection.commit()
db_login()
tbl1_del()
pop_tbl1()
dwh_connection.close()
##########################################
t1 = BashOperator(
task_id='DB_Connect',
python_callable=db_login(),
bash_command='python3 ~/airflow/dags/dwh_sample23.py',
dag=dag)
t2 = BashOperator(
task_id='del',
python_callable=tbl1_del(),
bash_command='python3 ~/airflow/dags/dwh_sample23.py',
dag=dag)
t3 = BashOperator(
task_id='populate',
python_callable=pop_tbl1(),
bash_command='python3 ~/airflow/dags/dwh_sample23.py',
dag=dag)
t1.set_downstream(t2)
t2.set_downstream(t3)
有人可以帮忙吗?感谢。
答案 0 :(得分:1)
您可以使用BashOperator
而不是使用PythonOperator
,并在db_login()
tbl1_del()
,pop_tbl1()
,PythonOperator
## Third party Library Imports
import pandas as pd
import psycopg2
import airflow
from airflow import DAG
from airflow.operators import PythonOperator
from datetime import datetime, timedelta
from sqlalchemy import create_engine
import io
# Following are defaults which can be overridden later on
default_args = {
'owner': 'admin',
'depends_on_past': False,
'start_date': datetime(2018, 5, 21),
'retries': 1,
'retry_delay': timedelta(minutes=1),
}
dag = DAG('dwh_sample23', default_args=default_args)
#######################
## Login to DB
def db_login():
''' This function connects to the Data Warehouse and returns the cursor to execute queries '''
global dwh_connection
try:
dwh_connection = psycopg2.connect(" dbname = 'dbname' user = 'user' password = 'password' host = 'hostname' port = '5439' sslmode = 'require' ")
except:
print("I am unable to connect to the database.")
print('Success')
return(dwh_connection)
def tbl1_del():
''' This function takes clears all rows from tbl1 '''
cur = dwh_connection.cursor()
cur.execute("""DELETE FROM tbl1;""")
dwh_connection.commit()
def pop_tbl1():
''' This function populates all rows in tbl1 '''
cur = dwh_connection.cursor()
cur.execute(""" INSERT INTO tbl1
select id,name,price from tbl2;""")
dwh_connection.commit()
db_login()
tbl1_del()
pop_tbl1()
dwh_connection.close()
##########################################
t1 = PythonOperator(
task_id='DB_Connect',
python_callable=db_login(),
dag=dag)
t2 = PythonOperator(
task_id='del',
python_callable=tbl1_del(),
dag=dag)
t3 = PythonOperator(
task_id='populate',
python_callable=pop_tbl1(),
dag=dag)
t1.set_downstream(t2)
t2.set_downstream(t3)
答案 1 :(得分:0)
这已经很老了,但是我们在 prod 中遇到了这个错误,我发现了这个问题,并且认为它有答案很好。
某些代码在 DAG 加载期间执行,即您实际运行
db_login()
tbl1_del()
pop_tbl1()
dwh_connection.close()
##########################################
在网络服务器和调度程序循环中,当它们从文件加载 dag 定义时。 我相信你不是故意的。 如果您只删除这 4 行,一切应该都可以正常工作。
通常不要将您希望执行器执行的函数放在文件/模块级别,因为当调度程序/网络服务器的解释器加载文件以获取 dag 定义时,它会调用它们。
尝试将其放入您的 dag 文件中,然后查看检查网络服务器日志以了解会发生什么。
from time import sleep
def do_some_printing():
print(1111111)
sleep(60)
do_some_printing()