使用 Dataflow (Python) 发布/订阅 BigQuery(批量)

时间:2021-07-28 07:37:20

标签: python-3.x google-bigquery google-cloud-dataflow apache-beam google-cloud-pubsub

我在 Python 中创建了一个流式数据流管道,只是想说明我下面的代码是否按我的预期运行。这就是我打算做的:

  1. 持续从 Pub/Sub 消费
  2. 每 1 分钟批量加载到 BigQuery 中,而不是流式传输以降低成本

这是 Python 中的代码片段

options = PipelineOptions(
    subnetwork=SUBNETWORK,
    service_account_email=SERVICE_ACCOUNT_EMAIL,
    use_public_ips=False,
    streaming=True,
    project=project,
    region=REGION,
    staging_location=STAGING_LOCATION,
    temp_location=TEMP_LOCATION,
    job_name=f"pub-sub-to-big-query-xxx-{datetime.now().strftime('%Y%m%d-%H%M%S')}"
)

p = beam.Pipeline(DataflowRunner(), options=options)


pubsub = (
        p
        | "Read Topic" >> ReadFromPubSub(topic=INPUT_TOPIC)
        | "To Dict" >> Map(json.loads)
        | "Write To BigQuery" >> WriteToBigQuery(table=TABLE, schema=schema, method='FILE_LOADS',
                                                 triggering_frequency=60, max_files_per_bundle=1,
                                                 create_disposition=BigQueryDisposition.CREATE_IF_NEEDED,
                                                 write_disposition=BigQueryDisposition.WRITE_APPEND))

我可以知道上面的代码是否在做我想要做的事情吗?从 Pub/Sub 流式传输,每 60 秒,它会批量插入 BigQuery。我特意将 ma​​x_files_per_bundle 设置为 1 以防止创建 1 个以上的分片,以便每分钟只加载 1 个文件,但不确定我是否做得对。 Java 版本有 withNumFileShards 选项,但我在 Python 中找不到等效项。我参考以下文档: https://beam.apache.org/releases/pydoc/2.31.0/apache_beam.io.gcp.bigquery.html#apache_beam.io.gcp.bigquery.WriteToBigQuery

https://cloud.google.com/blog/products/data-analytics/how-to-efficiently-process-both-real-time-and-aggregate-data-with-dataflow

只是想知道我是否应该使用窗口来实现我打算做的事情?

options = PipelineOptions(
    subnetwork=SUBNETWORK,
    service_account_email=SERVICE_ACCOUNT_EMAIL,
    use_public_ips=False,
    streaming=True,
    project=project,
    region=REGION,
    staging_location=STAGING_LOCATION,
    temp_location=TEMP_LOCATION,
    job_name=f"pub-sub-to-big-query-xxx-{datetime.now().strftime('%Y%m%d-%H%M%S')}"
)

p = beam.Pipeline(DataflowRunner(), options=options)

pubsub = (
        p
        | "Read Topic" >> ReadFromPubSub(topic=INPUT_TOPIC)
        | "To Dict" >> Map(json.loads)
        | 'Window' >> beam.WindowInto(window.FixedWindows(60), trigger=AfterProcessingTime(60),
                                      accumulation_mode=AccumulationMode.DISCARDING)
        | "Write To BigQuery" >> WriteToBigQuery(table=TABLE, schema=schema, method='FILE_LOADS',
                                                 triggering_frequency=60, max_files_per_bundle=1,
                                                 create_disposition=BigQueryDisposition.CREATE_IF_NEEDED,
                                                 write_disposition=BigQueryDisposition.WRITE_APPEND))

没有第二种方法中的窗口,第一种方法是否足够好?我现在正在使用第一种方法,但我不确定是否每分钟都是 从多个文件进行多次加载,还是实际上将所有发布/订阅消息合并为 1 并进行一次批量加载?< /p>

谢谢!

0 个答案:

没有答案