我正在尝试设置Airflow来管理我们的ETL流程,使用Amazon Linux 2 AMI启动EC2实例,使一个名为airflow
的用户,将我的代码移至/home/airflow/airflow
(所以{ {1}},依此类推,然后使用~/airflow/dags
进行如下设置:
(已删除凭据和敏感信息)
气流环境文件:
systemd
:
/etc/sysconfig/airflow
Airflow Systemd服务配置文件:
在SCHEDULER_RUNS=5
#Airflow specific settings
AIRFLOW_HOME="/home/airflow/airflow/"
AIRFLOW_CONN_REDSHIFT_CONNECTION=""
AIRFLOW_CONN_S3_CONNECTION=""
AIRFLOW_CONN_S3_LOGS_CONNECTION=""
AIRFLOW__CORE__FERNET_KEY=""
中:
/usr/lib/systemd/system/
(从airflow-scheduler.service
符号链接
/home/airflow/.airflow_config/
)
-rw-r--r-- 1 root root 1.3K Feb 21 16:18 airflow-scheduler.service
[Unit]
Description=Airflow scheduler daemon
After=network.target postgresql.service mysql.service redis.service rabbitmq-server.service
Wants=postgresql.service mysql.service redis.service rabbitmq-server.service
[Service]
EnvironmentFile=/etc/sysconfig/airflow
User=airflow
Group=airflow
Type=simple
ExecStart=/usr/bin/bash -c ' source /home/airflow/.env/bin/activate ; source /home/airflow/.bashrc ; airflow scheduler'
Restart=always
RestartSec=5s
[Install]
WantedBy=multi-user.target
:
(从airflow-webserver.service
符号链接
/home/airflow/.airflow_config/
)
-rw-r--r-- 1 root root 1.4K Feb 20 14:38 airflow-webserver.service
气流用户.bashrc文件:
[Unit]
Description=Airflow webserver daemon
After=network.target postgresql.service mysql.service redis.service rabbitmq-server.service
Wants=postgresql.service mysql.service redis.service rabbitmq-server.service
[Service]
EnvironmentFile=/etc/sysconfig/airflow
User=airflow
Group=airflow
Type=simple
ExecStart=/usr/bin/bash -c 'source /home/airflow/.env/bin/activate ; source /home/airflow/.bashrc ; airflow webserver -p 8080 --pid /run/airflow/webserver.pid'
Restart=on-failure
RestartSec=5s
PrivateTmp=true
[Install]
WantedBy=multi-user.target
:
/home/airflow/.bashrc
气流配置文件:
# .bashrc
# Source global definitions
if [ -f /etc/bashrc ]; then
. /etc/bashrc
fi
# User specific aliases and functions
# Aliases
alias python="python3"
alias pip="pip3"
alias airflow_venv="source $HOME/.env/bin/activate"
#Airflow specific settings
export AIRFLOW_HOME="/home/airflow/airflow/"
export AIRFLOW_CONN_REDSHIFT_CONNECTION=""
export AIRFLOW_CONN_S3_CONNECTION=""
export AIRFLOW_CONN_S3_LOGS_CONNECTION=""
export AIRFLOW__CORE__FERNET_KEY=""
# Credentials
export EXTERNAL_SERVICE_CREDENTIAL=""
export EXTERNAL_SERVICE_PASSWORD=""
:
(从airflow.cfg
链接到
/home/airflow/.airflow_config/
-rw-r--r-- 1 airflow airflow 5.3K Feb 12 17:45 airflow.cfg
DAG默认参数:
[core]
airflow_home = /home/airflow/airflow
dags_folder = /home/airflow/airflow/dags
base_log_folder = /home/airflow/airflow/logs
plugins_folder = /home/airflow/airflow/plugins
sql_alchemy_conn =
child_process_log_directory = /home/airflow/airflow/logs/scheduler
executor = LocalExecutor
remote_logging = True
remote_log_conn_id = s3_logs_connection
remote_base_log_folder = s3://my-bucket-here
encrypt_s3_logs = False
现在,我遇到的问题是,在具有几个任务的DAG中,每个任务都使用不同的os环境变量,即凭证(仅在default_args = {
'owner': 'airflow',
'depends_on_past': True,
'retry_on_failure': True,
'task_concurrency': 1,
'start_date': datetime(2019, 2, 19),
'max_active_runs': 1}
dag_name_here = DAG(
"dag_name_here", default_args=default_args, schedule_interval=timedelta(days=1))
中定义)或连接(在{{ 1}}和.bashrc
),有时第一个任务总是失败,有时是第二个任务,有时是第三个任务,依此类推,这意味着对于回填,我可能会看到第三个任务并行运行3个DagRun DAG中的任务正常运行,第二个任务未能获取环境变量。
例如,一个任务可能是/etc/sysconfig/airflow
,它可能成功,然后下一个.bashrc
可能返回一个CreateStagingRedshiftTable
错误,即使它们使用相同的错误。
我尝试使用envfile中的no,单引号和双引号,使用PopulateStagingTable
导出和不导出.bashrc和.bash_profile中的var,并且我不断运行Connection does not exist
任何想法或帮助都将不胜感激。