如何自动化BigQuery SQL管道

时间:2020-04-21 13:59:06

标签: google-cloud-platform google-bigquery google-cloud-dataflow

我使用BigQuery SQL创建了数据管道。 首先从Cloud Storage导入CSV文件,然后进行不同的分析,包括使用BigQueryML进行预测建模 使用地理功能计算地理,以及 使用解析函数计算KPI。

我能够成功地手动运行不同的查询,现在我想使数据管道自动化。

我的首选是DataFlow SQL,但是事实证明Dataflow SQL查询语法不支持地理功能。

DataFlow python的选择较少,因为完整的分析是在SQL中完成的,我希望保持这种方式。

我的问题是,还有哪些其他GCP选项可用于自动化数据管道。

2 个答案:

答案 0 :(得分:1)

正如我在评论中提到的那样,如果您需要整理查询,则可以使用Cloud Composer,这是一个完全托管的Airflow集群。

我创建了以下代码,以或多或少地向您展示如何使用此工具编排查询。请注意,这是基本代码,可以在编码标准方面进行改进。 该代码基本上编排了3个查询:

  1. 第一个从公共表读取并写入项目中的另一个表
  2. 第二个读取第一个查询中创建的表,并根据日期列选择10000条最新行。之后,它将结果保存到项目中的表中。
  3. 第三个读取第2步中创建的表并计算一些聚合。之后,它将结果保存到项目中的另一个表中。

    import datetime
    from airflow import models
    from airflow.contrib.operators import bigquery_operator
    
    """The condiguration presented below will run your DAG every five minutes as specified in the 
    schedule_interval property starting from the datetime specified in the start_date property"""
    
    default_dag_args = {
        'start_date': datetime.datetime(2020, 4, 22, 15, 40), 
        'email_on_failure': False,
        'email_on_retry': False,
        'retries': 1,
        'retry_delay': datetime.timedelta(minutes=1),
        'project_id': "<your_project_id>",
    }
    
    with models.DAG(
            'composer_airflow_bigquery_orchestration',
            schedule_interval = "*/5 * * * *",
            default_args=default_dag_args) as dag:
    
        run_first_query = bigquery_operator.BigQueryOperator(
            sql = "SELECT * FROM `bigquery-public-data.catalonian_mobile_coverage.mobile_data_2015_2017`",
            destination_dataset_table = "<your_project>.<your_dataset>.orchestration_1",
            task_id = 'xxxxxxxx',
            write_disposition = "WRITE_TRUNCATE",
            #create_disposition = "",
            allow_large_results = True,
            use_legacy_sql = False
        )
    
        run_second_query = bigquery_operator.BigQueryOperator(
            sql = "SELECT * FROM `<your_project>.orchestration_1` ORDER BY date LIMIT 10000 ",
            destination_dataset_table = "<your_project>.<your_dataset>.orchestration_2",
            task_id = 'yyyyyyyy',
            write_disposition = "WRITE_TRUNCATE",
            #create_disposition = "",
            allow_large_results = True,
            use_legacy_sql = False
        )
    
        run_third_query = bigquery_operator.BigQueryOperator(
            sql = "SELECT round(lat) r_lat, round(long) r_long, count(1) total FROM`<your_project>.orchestration_2` GROUP BY r_lat,r_long",
            destination_dataset_table = "<your_project>.<your_dataset>.orchestration_3",
            task_id = 'zzzzzzzz',
            write_disposition = "WRITE_TRUNCATE",
            #create_disposition = "",
            allow_large_results = True,
            use_legacy_sql = False
        )
    
    
       # Define DAG dependencies.
        run_first_query >> run_second_query >> run_third_query
    

要逐步进行:

  • 首先,它导入了一些Airflow库,例如模型和bigquery_operator

    from airflow import models
    from airflow.contrib.operators import bigquery_operator
    
  • 然后,它定义了一个名为default_dag_args的字典,将在创建DAG时使用。

    default_dag_args = {
        'start_date': datetime.datetime(2020, 4, 22, 15, 40), 
        'email_on_failure': False,
        'email_on_retry': False,
        'retries': 1,
        'retry_delay': datetime.timedelta(minutes=1),
        'project_id': "<your_project_id>",
    }
    
  • 在创建DAG时,您将default_dag_args字典作为默认参数,并添加了schedule interval参数来定义何时应运行DAG。您可以将此参数与某些预设表达式或CRON表达式一起使用,如您所见here

    with models.DAG(
            'composer_airflow_bigquery_orchestration',
            schedule_interval = "*/5 * * * *",
            default_args=default_dag_args) as dag:
    
  • 之后,您可以创建操作员的实例。在这种情况下,我们仅使用BigQueryOperator

        run_first_query = bigquery_operator.BigQueryOperator(
            sql = "SELECT * FROM `bigquery-public-data.catalonian_mobile_coverage.mobile_data_2015_2017`",
            destination_dataset_table = "<your_project>.<your_dataset>.orchestration_1",
            task_id = 'xxxxxxxx',
            write_disposition = "WRITE_TRUNCATE",
            #create_disposition = "",
            allow_large_results = True,
            use_legacy_sql = False
        )
    
        run_second_query = bigquery_operator.BigQueryOperator(
            sql = "SELECT * FROM `<your_project>.orchestration_1` ORDER BY date LIMIT 10000 ",
            destination_dataset_table = "<your_project>.<your_dataset>.orchestration_2",
            task_id = 'yyyyyyyy',
            write_disposition = "WRITE_TRUNCATE",
            #create_disposition = "",
            allow_large_results = True,
            use_legacy_sql = False
        )
    
        run_third_query = bigquery_operator.BigQueryOperator(
            sql = "SELECT round(lat) r_lat, round(long) r_long, count(1) total FROM`<your_project>.orchestration_2` GROUP BY r_lat,r_long",
            destination_dataset_table = "<your_project>.<your_dataset>.orchestration_3",
            task_id = 'zzzzzzzz',
            write_disposition = "WRITE_TRUNCATE",
            #create_disposition = "",
            allow_large_results = True,
            use_legacy_sql = False
        )
    
  • 最后一步,我们可以为DAG定义依赖项。这段代码意味着run_second_query操作取决于run_first_query的结论,因此就可以了。

        run_first_query >> run_second_query >> run_third_query
    

最后,我想添加这个article,讨论使用CRON表达式时如何正确设置start_date和schedule_interval。

答案 1 :(得分:0)

BigQuery具有内置的调度机制,目前处于beta功能。

要自动执行BQ本机SQL管道,可以使用此实用程序。 使用CLI:

$ bq query \
--use_legacy_sql=false \
--destination_table=mydataset.mytable \
--display_name='My Scheduled Query' \
--replace=true \
'SELECT
1
FROM
mydataset.test'