Question

Apache Beam 2.1.0有一个从BigQuery读取的模板管道的错误，这意味着它们只能被执行一次。更多细节https://issues.apache.org/jira/browse/BEAM-2058

这已经通过Beam 2.2.0的发布得到修复，您现在可以使用 withTemplateCompatibility 选项从BigQuery读取，您的模板管道现在可以多次运行。

  pipeline
    .apply("Read rows from table."
         , BigQueryIO.readTableRows()
                     .withTemplateCompatibility()
                     .from("<your-table>")
                     .withoutValidation())

这个实现似乎带来了BigQueryIO读取操作的巨大性能成本，我现在有批量管道，在 8-11分钟中运行 45-50分钟完成。两个管道之间的唯一区别是 .withTemplateCompatibility（）。

我试图了解性能大幅下降的原因以及是否有任何方法可以改善它们。

感谢。

解决方案：基于jkff的输入。

  pipeline
    .apply("Read rows from table."
         , BigQueryIO.readTableRows()
                     .withTemplateCompatibility()
                     .from("<your-table>")
                     .withoutValidation())
    .apply("Reshuffle",  Reshuffle.viaRandomKey())

Answer 1

我怀疑这是因为withTemplateCompatibility代价是为此读取步骤禁用dynamic rebalancing。

我希望它只有在您阅读少量或中等数量的数据时才会产生重大影响，但会对其执行非常繁重的处理。在这种情况下，请尝试在Reshuffle.viaRandomKey()上添加BigQueryIO.read()。它将实现数据的临时副本，但会更好地并行化下游处理。

BigQueryIO使用withTemplateCompatibility

1 个答案: