您如何访问气流任务作业的输出文件并将其用作下一个任务的输入?

时间:2019-09-17 01:21:39

标签: etl airflow

我在DAG中有4个任务,其中t1, t2 and t3是bash_operator,而t4是python_operator。 t1的命令从NCBI数据库下载蛋白质结构,t2使用该结构并运行作业并输出另一个结构。 t3使用t2的输出并运行另一个作业并输出一个csv文件,t4清理并分析该csv文件。我的问题是,在t1上下载的文件以及t2t3的输出的默认位置在哪里? 当我在Airflow外部的t1中运行命令时,文件被下载到运行该命令的目录中,但是我似乎无法从airflow找到该文件。另外,默认情况下,t2在哪里寻找输入文件来运行其命令?我们可以更改输入文件的外观吗?

# Here are t1 & t2:

t1 = BashOperator(
    task_id='get_pdb_1',
    bash_command='$SCHRODINGER/utilities/getpdb -r 3hfm',
    dag=dag)

# $SCHRODINGER/utilities/getpdb -r 3hfm 
# SCHRODINGER is set to a software on my .bashrc and normally the above command downloads a structure 3hfm.pdb in the directory it's ran in. 

t2 = BashOperator(
    task_id='prepare_pdb_1',
    bash_command='$SCHRODINGER/utilities/prepwizard 3hfm.pdb test1.pdb',
    retries=3,
    dag=dag)

# $SCHRODINGER/utilities/prepwizard 3hfm.pdb test1.pdb
# This command inputs a structure 3hfm.pdb and outputs test1.pdb in a directory it's ran in.

这里t1成功,它说成功保存了文件,但是我找不到该文件的保存位置,并且t2失败了,找不到输入文件3hfm.pdb应该由t1的命令下载。

t1的输出

[2019-09-16 12:47:16,767] {bash_operator.py:91} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_ID=schro_run
AIRFLOW_CTX_TASK_ID=get_pdb_1
AIRFLOW_CTX_EXECUTION_DATE=2019-09-16T19:46:28.978931+00:00
AIRFLOW_CTX_DAG_RUN_ID=manual__2019-09-16T19:46:28.978931+00:00
[2019-09-16 12:47:16,768] {bash_operator.py:105} INFO - Temporary script location: /var/folders/j0/gtzmrlh13v1660j7yq3zdt6r000991/T/airflowtmp0h31csx4/get_pdb_1kj71allt
[2019-09-16 12:47:16,768] {bash_operator.py:115} INFO - Running command: $SCHRODINGER/utilities/getpdb -r 3hfm
[2019-09-16 12:47:16,777] {bash_operator.py:124} INFO - Output:
[2019-09-16 12:47:18,707] {bash_operator.py:128} INFO - Downloading 3hfm...
[2019-09-16 12:47:19,001] {bash_operator.py:128} INFO - saved data to file: 3hfm.pdb
[2019-09-16 12:47:19,084] {bash_operator.py:132} INFO - Command exited with return code 0
[2019-09-16 12:47:20,754] {logging_mixin.py:95} INFO - [[34m2019-09-16 12:47:20,754[0m] {[34mlocal_task_job.py:[0m105} INFO[0m - Task exited with return code 0[0m

t2的输出

[2019-09-16 13:04:13,867] {bash_operator.py:91} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_ID=schro_run
AIRFLOW_CTX_TASK_ID=prepare_pdb_1
AIRFLOW_CTX_EXECUTION_DATE=2019-09-16T19:46:28.978931+00:00
AIRFLOW_CTX_DAG_RUN_ID=manual__2019-09-16T19:46:28.978931+00:00
[2019-09-16 13:04:13,868] {bash_operator.py:105} INFO - Temporary script location: /var/folders/j0/gtzmrlh13v1660j7yq3zdt6r000991/T/airflowtmpgc8jzd_v/prepare_pdb_1wrbbup5b
[2019-09-16 13:04:13,868] {bash_operator.py:115} INFO - Running command: $SCHRODINGER/utilities/prepwizard 3hfm.pdb test1.pdb
[2019-09-16 13:04:13,876] {bash_operator.py:124} INFO - Output:
[2019-09-16 13:04:15,725] {bash_operator.py:128} INFO - Usage: $SCHRODINGER/utilities/prepwizard [options] inputfile outputfile
prepwizard_startup.py: error: Error: input file not found: 3hfm.pdb
[2019-09-16 13:04:15,832] {bash_operator.py:132} INFO - Command exited with return code 2
[2019-09-16 13:04:15,839] {taskinstance.py:1051} ERROR - Bash command failed
Traceback (most recent call last):
  File "/Users/chamiso/Documents/Random/randomRepos/airflow_practice/practice-airflow.env/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 926, in _run_raw_task
    result = task_copy.execute(context=context)
  File "/Users/chamiso/Documents/Random/randomRepos/airflow_practice/practice-airflow.env/lib/python3.7/site-packages/airflow/operators/bash_operator.py", line 136, in execute
    raise AirflowException("Bash command failed")
airflow.exceptions.AirflowException: Bash command failed

0 个答案:

没有答案