Question

我正在努力批量发布Google Pub / Sub数据以发送给Apache Beam。这是我的基本代码。

 p.begin()
            .apply("Input", PubsubIO.readAvros(CmgData.class).fromTopic("topicname"))
            .apply("Transform", ParDo.of(new TransformData()))
            .apply("Write", BigQueryIO.writeTableRows()
                    .to(table)
                    .withSchema(schema)
                    .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
                    .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
        p.run().waitUntilFinish();

显然，Apache Beam认为数据是未绑定的，因为它来自订阅，但我想批量处理并发送它。有很多不同的项目提到有界如下： - PCollection.IsBounded（https://beam.apache.org/documentation/sdks/javadoc/2.4.0/org/apache/beam/sdk/values/PCollection.IsBounded.html） - 似乎对写入没有影响。

BoundedReadFromUnboundedSource - （https://beam.apache.org/documentation/sdks/javadoc/2.4.0/org/apache/beam/sdk/io/BoundedReadFromUnboundedSource.html） - 无法找到将PCollection转换为有界源的方法，反之亦然。

BoundedWindow - （https://beam.apache.org/documentation/sdks/javadoc/2.4.0/org/apache/beam/sdk/transforms/windowing/BoundedWindow.html） - 无法找到工作用途

Write.Method - （https://beam.apache.org/documentation/sdks/javadoc/2.2.0/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.Write.Method.html） - 当我尝试使用它时抛出IllegalArgumentException。

有人能指出我如何声明一个对象是有界数据的方向，所以我可以批处理它而不仅仅是流？

Answer 1

有关详细信息，您可以查看我的其他问题 BigQuery writeTableRows Always writing to buffer

但是，添加以下三行意味着数据将被绑定： -

            .withMethod(Method.FILE_LOADS)
            .withTriggeringFrequency(org.joda.time.Duration.standardMinutes(2))
            .withNumFileShards(1000)

Apache Beam - BigQuery - Google Pub / Sub Batch

1 个答案: