编辑：任务引用与ID

Question

我需要每天将表从MySQL复制到BigQuery。我的工作流程是：

MySqlToGoogleCloudStorageOperator
GoogleCloudStorageToBigQueryOperator

这适用于单个进程（例如Categories）。

示例：

BQ_TABLE_NAME_CATEGORIES = Variable.get("tables_categories")
...

import_categories_op = MySqlToGoogleCloudStorageOperator(
    task_id='import_categories',
    mysql_conn_id='c_mysql',
    google_cloud_storage_conn_id='gcp_a',
    approx_max_file_size_bytes = 100000000, #100MB per file
    sql = 'import_categories.sql',
    bucket=GCS_BUCKET_ID,
    filename=file_name_categories,
    dag=dag)

gcs_to_bigquery_categories_op = GoogleCloudStorageToBigQueryOperator(
    dag=dag,
    task_id='load_categories_to_BigQuery',
    bucket=GCS_BUCKET_ID,
    destination_project_dataset_table=table_name_template_categories,
    source_format='NEWLINE_DELIMITED_JSON',
    source_objects=[uri_template_categories_read_from],
    schema_fields=Categories(),
    src_fmt_configs={'ignoreUnknownValues': True},
    create_disposition='CREATE_IF_NEEDED',
    write_disposition='WRITE_TRUNCATE',
    skip_leading_rows = 1,
    google_cloud_storage_conn_id=CONNECTION_ID,
    bigquery_conn_id=CONNECTION_ID)


import_categories_op >> gcs_to_bigquery_categories_op

现在，说我想扩大规模并使其与20个以上的表一起使用。是否有办法做到，而无需编写相同的代码20次？我正在寻找一种类似的方法：

BQ_TABLE_NAME_CATEGORIES = Variable.get("tables_categories")
BQ_TABLE_NAME_PRODUCTS = Variable.get("tables_products")
....
BQ_TABLE_NAME_ORDERS = Variable.get("tables_orders")
list = [BQ_TABLE_NAME_CATEGORIES,BQ_TABLE_NAME_PRODUCTS,BQ_TABLE_NAME_PRODUCTS  ]
for item in list:
    GENERATE THE OPERATORS PER TABLE

这样将创建import_categories_op，import_products_op，import_orders_op等。

Answer 1

是的，实际上就是您所描述的。只需在for循环中实例化您的运算符。确保您的任务ID是唯一的并且已设置好：

BQ_TABLE_NAME_CATEGORIES = Variable.get("tables_categories")
BQ_TABLE_NAME_PRODUCTS = Variable.get("tables_products")

list = [BQ_TABLE_NAME_CATEGORIES, BQ_TABLE_NAME_PRODUCTS]

for table in list:
    import_op = MySqlToGoogleCloudStorageOperator(
        task_id=`import_${table}`,
        mysql_conn_id='c_mysql',
        google_cloud_storage_conn_id='gcp_a',
        approx_max_file_size_bytes = 100000000, #100MB per file
        sql = `import_${table}.sql`,
        bucket=GCS_BUCKET_ID,
        filename=file_name,
        dag=dag)
    gcs_to_bigquery_op = GoogleCloudStorageToBigQueryOperator(
        dag=dag,
        task_id=`load_${table}_to_BigQuery`,
        bucket=GCS_BUCKET_ID,
        destination_project_dataset_table=table_name_template,
        source_format='NEWLINE_DELIMITED_JSON',
        source_objects=[uri_template_read_from],
        schema_fields=Categories(),
        src_fmt_configs={'ignoreUnknownValues': True},
        create_disposition='CREATE_IF_NEEDED',
        write_disposition='WRITE_TRUNCATE',
        skip_leading_rows = 1,
        google_cloud_storage_conn_id=CONNECTION_ID,
        bigquery_conn_id=CONNECTION_ID)


    import_op >> gcs_to_bigquery_op

如果将所有表存储在一个变量中，则可以简化此操作：

// bq_tables = "table_products,table_orders"
BQ_TABLES = Variable.get("bq_tables").split(',')

for table in BQ_TABLES:
    ...

编辑：任务引用与ID

Luis询问仅需要更改任务ID（而不更改对任务的引用）的问题。实际上，除了在创建后向它们添加一些细节（例如上游和下游依赖项）之外，您甚至不需要引用任何任务，因为它们在创建时存储在DAG对象中，而DAG解析器就是寻找。一旦DAG解析器在全局范围内找到DAG对象，它就会使用它。它不知道任务在全局范围中被称为什么名称，它只知道这些任务在DAG对象上列出，并且它们在上游或下游相互列出。

本来我会对此答案发表评论，但我想显示以下代码来更清楚地解释我的意思（在此，我使用with DAG以避免将每个任务分配给dag，以及按位移位运算符的上游/下游分配，以避免甚至需要通过引用来引用任务，以及python3的格式化f字符串）：

// bq_tables = "table_products,table_orders"
BQ_TABLES = Variable.get("bq_tables").split(',')

with DAG('…dag_id…', …) as dag:
    for table in BQ_TABLES:
        MySqlToGoogleCloudStorageOperator(
            task_id=f'import_{table}',
            sql=f'import_{table}.sql',
            …  # all params except notably there's no `dag=dag` in here.
        ) >> GoogleCloudStorageToBigQueryOperator(  # Yup, …
            task_id=f'load_{table}_to_BigQuery',
            …  # again all but `dag=dag` in here.
        )

当然可以了t1=…; t2=…; t1>>t2; …，但是为什么要引用名称？

如何从Airflow中的列表创建操作员？

1 个答案:

编辑：任务引用与ID