Question

我有一个Dataflow管道从PubSub主题读取事件数据。收到消息时，我会执行转换步骤，以使事件数据适合我想要的BigQuery架构。但是，如果我创建的输入不适合模式，我会遇到问题。显然，写入BigQuery是无限重试的：

Count: 76   RuntimeError: Could not successfully insert rows to BigQuery table

目前我正在进行大量的手动检查，输入确实符合架构，但是，在我没有考虑的情况下，我会积累RuntimeErrors。有没有办法尝试写入BigQuery，以防万一用原始输入执行其他操作失败？或者，有没有办法尝试多次写入，否则无需添加新的RuntimeErrors就会无声地失败？

编辑：我正在使用python SDK。以下是我进一步澄清的简化管道：

with beam.Pipeline(options=options) as pipeline:
    # Read messages from PubSub
    event = (pipeline
             | 'Read from PubSub' >> beam.io.gcp.pubsub.ReadStringsFromPubSub(topic))

    output = (event
              | 'Create output' >> beam.transforms.core.FlatMap(lambda event: [{'input': event}]))

    # Write to Big Query
    _ = (output
         | 'Write log to BigQuery' >> beam.io.gcp.bigquery.WriteToBigQuery(
             table=table,
             dataset=dataset,
             project=project,
             schema=schema,
             create_disposition=beam.io.gcp.bigquery.BigQueryDisposition.CREATE_NEVER,
             write_disposition=beam.io.gcp.bigquery.BigQueryDisposition.WRITE_APPEND))

如果我的表中没有列'input'，则作业将死亡。看了https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py#L1279之后，这就是这种行为的原因。通过自定义https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py#L1187而不是提高RuntimeError，我可以克服我的问题，但是，这感觉非常麻烦。有人建议采用更简单的方法吗？

Answer 1

如果您自己编写了管道，那么您应该可以在BigQueryIO上使用setFailedInsertRetryPolicy InsertRetryPolicy.neverRetry

Answer 2

Beam - 用于流媒体的Python SDK非常有限。

https://beam.apache.org/documentation/sdks/python-streaming/

Python流式传输管道执行是从Beam SDK版本2.5.0开始实验可用的（有一些限制）。

Python流执行目前不支持以下功能。

常规光束功能：这些不受支持的Beam功能适用于所有跑步者。

州和计时器API
自定义源API
可拆分的DoFn API
处理迟到的数据
用户定义的自定义WindowFn

DataflowRunner特定功能：此外，DataflowRunner目前不支持使用Python流执行的以下Cloud Dataflow特定功能。

流式自动缩放
更新现有管道
云数据流模板
某些监控功能，例如msec计数器，显示数据，指标和转换的元素计数。但是，支持源的日志记录，水印和元素计数。

此处提供更多信息：https://beam.apache.org/documentation/sdks/python-streaming/#unsupported-features

另请查看DataFlow文档中的以下发行说明：

Answer 3

（使用直接运行器时）可能会帮助您的事情是将['FailedRows']从插入位置移至

 final_to_bq = (data
                   | 'Write to BQ' >> beam.io.WriteToBigQuery( ... )
)

然后：

print_failed_rows = (final_to_bq['FailedRows']
                         | 'print failed' >> beam.ParDo(Printer())
                         )

这对使用 DirectRunner 很有帮助……但还不能与 DatflowRunner 一起使用...

在Dataflow管道中写入BigQuery时捕获失败

3 个答案: