我使用BigQuery SQL创建了数据管道。 首先从Cloud Storage导入CSV文件,然后进行不同的分析,包括使用BigQueryML进行预测建模 使用地理功能计算地理,以及 使用解析函数计算KPI。
我能够成功地手动运行不同的查询,现在我想使数据管道自动化。
我的首选是DataFlow SQL,但是事实证明Dataflow SQL查询语法不支持地理功能。
DataFlow python的选择较少,因为完整的分析是在SQL中完成的,我希望保持这种方式。
我的问题是,还有哪些其他GCP选项可用于自动化数据管道。
答案 0 :(得分:1)
正如我在评论中提到的那样,如果您需要整理查询,则可以使用Cloud Composer
,这是一个完全托管的Airflow
集群。
我创建了以下代码,以或多或少地向您展示如何使用此工具编排查询。请注意,这是基本代码,可以在编码标准方面进行改进。 该代码基本上编排了3个查询:
第三个读取第2步中创建的表并计算一些聚合。之后,它将结果保存到项目中的另一个表中。
import datetime
from airflow import models
from airflow.contrib.operators import bigquery_operator
"""The condiguration presented below will run your DAG every five minutes as specified in the
schedule_interval property starting from the datetime specified in the start_date property"""
default_dag_args = {
'start_date': datetime.datetime(2020, 4, 22, 15, 40),
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': datetime.timedelta(minutes=1),
'project_id': "<your_project_id>",
}
with models.DAG(
'composer_airflow_bigquery_orchestration',
schedule_interval = "*/5 * * * *",
default_args=default_dag_args) as dag:
run_first_query = bigquery_operator.BigQueryOperator(
sql = "SELECT * FROM `bigquery-public-data.catalonian_mobile_coverage.mobile_data_2015_2017`",
destination_dataset_table = "<your_project>.<your_dataset>.orchestration_1",
task_id = 'xxxxxxxx',
write_disposition = "WRITE_TRUNCATE",
#create_disposition = "",
allow_large_results = True,
use_legacy_sql = False
)
run_second_query = bigquery_operator.BigQueryOperator(
sql = "SELECT * FROM `<your_project>.orchestration_1` ORDER BY date LIMIT 10000 ",
destination_dataset_table = "<your_project>.<your_dataset>.orchestration_2",
task_id = 'yyyyyyyy',
write_disposition = "WRITE_TRUNCATE",
#create_disposition = "",
allow_large_results = True,
use_legacy_sql = False
)
run_third_query = bigquery_operator.BigQueryOperator(
sql = "SELECT round(lat) r_lat, round(long) r_long, count(1) total FROM`<your_project>.orchestration_2` GROUP BY r_lat,r_long",
destination_dataset_table = "<your_project>.<your_dataset>.orchestration_3",
task_id = 'zzzzzzzz',
write_disposition = "WRITE_TRUNCATE",
#create_disposition = "",
allow_large_results = True,
use_legacy_sql = False
)
# Define DAG dependencies.
run_first_query >> run_second_query >> run_third_query
要逐步进行:
首先,它导入了一些Airflow库,例如模型和bigquery_operator
from airflow import models
from airflow.contrib.operators import bigquery_operator
然后,它定义了一个名为default_dag_args
的字典,将在创建DAG时使用。
default_dag_args = {
'start_date': datetime.datetime(2020, 4, 22, 15, 40),
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': datetime.timedelta(minutes=1),
'project_id': "<your_project_id>",
}
在创建DAG时,您将default_dag_args
字典作为默认参数,并添加了schedule interval
参数来定义何时应运行DAG。您可以将此参数与某些预设表达式或CRON表达式一起使用,如您所见here
with models.DAG(
'composer_airflow_bigquery_orchestration',
schedule_interval = "*/5 * * * *",
default_args=default_dag_args) as dag:
之后,您可以创建操作员的实例。在这种情况下,我们仅使用BigQueryOperator
run_first_query = bigquery_operator.BigQueryOperator(
sql = "SELECT * FROM `bigquery-public-data.catalonian_mobile_coverage.mobile_data_2015_2017`",
destination_dataset_table = "<your_project>.<your_dataset>.orchestration_1",
task_id = 'xxxxxxxx',
write_disposition = "WRITE_TRUNCATE",
#create_disposition = "",
allow_large_results = True,
use_legacy_sql = False
)
run_second_query = bigquery_operator.BigQueryOperator(
sql = "SELECT * FROM `<your_project>.orchestration_1` ORDER BY date LIMIT 10000 ",
destination_dataset_table = "<your_project>.<your_dataset>.orchestration_2",
task_id = 'yyyyyyyy',
write_disposition = "WRITE_TRUNCATE",
#create_disposition = "",
allow_large_results = True,
use_legacy_sql = False
)
run_third_query = bigquery_operator.BigQueryOperator(
sql = "SELECT round(lat) r_lat, round(long) r_long, count(1) total FROM`<your_project>.orchestration_2` GROUP BY r_lat,r_long",
destination_dataset_table = "<your_project>.<your_dataset>.orchestration_3",
task_id = 'zzzzzzzz',
write_disposition = "WRITE_TRUNCATE",
#create_disposition = "",
allow_large_results = True,
use_legacy_sql = False
)
最后一步,我们可以为DAG定义依赖项。这段代码意味着run_second_query操作取决于run_first_query的结论,因此就可以了。
run_first_query >> run_second_query >> run_third_query
最后,我想添加这个article,讨论使用CRON表达式时如何正确设置start_date和schedule_interval。
答案 1 :(得分:0)
BigQuery具有内置的调度机制,目前处于beta功能。
要自动执行BQ本机SQL管道,可以使用此实用程序。 使用CLI:
$ bq query \
--use_legacy_sql=false \
--destination_table=mydataset.mytable \
--display_name='My Scheduled Query' \
--replace=true \
'SELECT
1
FROM
mydataset.test'