使用cloud-composer执行以下python脚本时,我在*** Task instance did not exist in the DB
任务下得到了gcs2bq
登录气流
代码:
import datetime
import os
import csv
import pandas as pd
import pip
from airflow import models
#from airflow.contrib.operators import dataproc_operator
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from airflow.utils import trigger_rule
from airflow.contrib.operators import gcs_to_bq
from airflow.contrib.operators import bigquery_operator
print('''/-------/--------/------/
-------/--------/------/''')
yesterday = datetime.datetime.combine(
datetime.datetime.today() - datetime.timedelta(1),
datetime.datetime.min.time())
default_dag_args = {
# Setting start date as yesterday starts the DAG immediately when it is
# detected in the Cloud Storage bucket.
'start_date': yesterday,
# To email on failure or retry set 'email' arg to your email and enable
# emailing here.
'email_on_failure': False,
'email_on_retry': False,
# If a task fails, retry it once after waiting at least 5 minutes
'retries': 1,
'retry_delay': datetime.timedelta(minutes=5),
'project_id': 'data-rubrics'
#models.Variable.get('gcp_project')
}
try:
# [START composer_quickstart_schedule]
with models.DAG(
'composer_agg_quickstart',
# Continue to run DAG once per day
schedule_interval=datetime.timedelta(days=1),
default_args=default_dag_args) as dag:
# [END composer_quickstart_schedule]
op_start = BashOperator(task_id='Initializing', bash_command='echo Initialized')
#op_readwrite = PythonOperator(task_id = 'ReadAggWriteFile', python_callable=read_data)
op_load = gcs_to_bq.GoogleCloudStorageToBigQueryOperator( \
task_id='gcs2bq',\
bucket='dr-mockup-data',\
source_objects=['sample.csv'],\
destination_project_dataset_table='data-rubrics.sample_bqtable',\
schema_fields = [{'name':'a', 'type':'STRING', 'mode':'NULLABLE'},{'name':'b', 'type':'FLOAT', 'mode':'NULLABLE'}],\
write_disposition='WRITE_TRUNCATE',\
dag=dag)
#op_write = PythonOperator(task_id = 'AggregateAndWriteFile', python_callable=write_data)
op_start >> op_load
答案 0 :(得分:0)
更新:
您是否可以从dag=dag
中删除gcs2bq
,因为您已经在使用with models.DAG
并再次运行dag?
这可能是因为您有一个动态的开始日期。您的start_date
绝对不能动态。阅读以下常见问题解答:https://airflow.apache.org/faq.html#what-s-the-deal-with-start-date
我们建议您不要使用动态值作为start_date,尤其是datetime.now(),因为它可能会造成混乱。该时间段结束后便会触发该任务,理论上,当now()前进时,@ hourly DAG永远不会到现在的一个小时之后。
将您的start_date
设为静态,或使用Airflow utils / macros:
import airflow
args = {
'owner': 'airflow',
'start_date': airflow.utils.dates.days_ago(2),
}
答案 1 :(得分:0)
好的,这对我来说是一个愚蠢的问题,对于在这里浪费时间的每个人都表示歉意。我跑了一个Dag,正因为如此,我一直在射击。另外,我没有在destination_project_dataset_table
中写入正确的值。感谢所有花时间的人。