在Composer中运行gcs_to_bq时,获取***任务实例在数据库中不存在,因此错误

时间:2018-11-19 10:47:44

标签: airflow google-cloud-composer

使用cloud-composer执行以下python脚本时,我在*** Task instance did not exist in the DB任务下得到了gcs2bq登录气流 代码:

import datetime
import os
import csv
import pandas as pd
import pip
from airflow import models
#from airflow.contrib.operators import dataproc_operator
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from airflow.utils import trigger_rule
from airflow.contrib.operators import gcs_to_bq
from airflow.contrib.operators import bigquery_operator

print('''/-------/--------/------/
-------/--------/------/''')
yesterday = datetime.datetime.combine(
    datetime.datetime.today() - datetime.timedelta(1),
    datetime.datetime.min.time())
default_dag_args = {
    # Setting start date as yesterday starts the DAG immediately when it is
    # detected in the Cloud Storage bucket.
    'start_date': yesterday,
    # To email on failure or retry set 'email' arg to your email and enable
    # emailing here.
    'email_on_failure': False,
    'email_on_retry': False,
    # If a task fails, retry it once after waiting at least 5 minutes
    'retries': 1,
    'retry_delay': datetime.timedelta(minutes=5),
    'project_id': 'data-rubrics'
    #models.Variable.get('gcp_project')
}
try:
  # [START composer_quickstart_schedule]
  with models.DAG(
        'composer_agg_quickstart',
        # Continue to run DAG once per day
        schedule_interval=datetime.timedelta(days=1),
        default_args=default_dag_args) as dag:
    # [END composer_quickstart_schedule]
      op_start = BashOperator(task_id='Initializing', bash_command='echo Initialized')
      #op_readwrite = PythonOperator(task_id = 'ReadAggWriteFile', python_callable=read_data)
      op_load = gcs_to_bq.GoogleCloudStorageToBigQueryOperator( \
task_id='gcs2bq',\
bucket='dr-mockup-data',\
source_objects=['sample.csv'],\
destination_project_dataset_table='data-rubrics.sample_bqtable',\
schema_fields = [{'name':'a', 'type':'STRING', 'mode':'NULLABLE'},{'name':'b', 'type':'FLOAT', 'mode':'NULLABLE'}],\
write_disposition='WRITE_TRUNCATE',\
dag=dag)
      #op_write = PythonOperator(task_id = 'AggregateAndWriteFile', python_callable=write_data)
      op_start >> op_load

2 个答案:

答案 0 :(得分:0)

更新

您是否可以从dag=dag中删除gcs2bq,因为您已经在使用with models.DAG并再次运行dag?


这可能是因为您有一个动态的开始日期。您的start_date绝对不能动态。阅读以下常见问题解答:https://airflow.apache.org/faq.html#what-s-the-deal-with-start-date

  

我们建议您不要使用动态值作为start_date,尤其是datetime.now(),因为它可能会造成混乱。该时间段结束后便会触发该任务,理论上,当now()前进时,@ hourly DAG永远不会到现在的一个小时之后。

将您的start_date设为静态,或使用Airflow utils / macros:

import airflow
args = {
    'owner': 'airflow',
    'start_date': airflow.utils.dates.days_ago(2),
}

答案 1 :(得分:0)

好的,这对我来说是一个愚蠢的问题,对于在这里浪费时间的每个人都表示歉意。我跑了一个Dag,正因为如此,我一直在射击。另外,我没有在destination_project_dataset_table中写入正确的值。感谢所有花时间的人。