BigQuery writeTableRows总是写入缓冲区

时间:2018-04-19 07:34:34

标签: google-bigquery apache-beam

我们正在尝试使用Apache Beam和avro写入Big Query。

以下似乎可行: -

p.apply("Input", AvroIO.read(DataStructure.class).from("AvroSampleFile.avro"))
            .apply("Transform", ParDo.of(new CustomTransformFunction()))
            .apply("Load", BigQueryIO.writeTableRows().to(table).withSchema(schema));

然后我们尝试以下列方式使用它来从Google Pub / Sub

获取数据
p.begin()
            .apply("Input", PubsubIO.readAvros(DataStructure.class).fromTopic("topicName"))
            .apply("Transform", ParDo.of(new CustomTransformFunction()))
            .apply("Write", BigQueryIO.writeTableRows()
                    .to(table)
                    .withSchema(schema)
                    .withTimePartitioning(timePartitioning)
                    .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
                    .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
        p.run().waitUntilFinish();

当我们这样做时,它总是将它推送到缓冲区,Big Query似乎需要很长时间才能从缓冲区中读取。谁能告诉我为什么以上不会直接将记录写入Big Query表?

更新: - 看起来我需要添加以下设置,但这会抛出java.lang.IllegalArgumentException。

.withMethod(Method.FILE_LOADS)
.withTriggeringFrequency(org.joda.time.Duration.standardMinutes(2))

1 个答案:

答案 0 :(得分:1)

答案是你需要包含" withNumFileShards"像这样(可以是1到1000)。

        p.begin()
            .apply("Input", PubsubIO.readAvros(DataStructure.class).fromTopic("topicName"))
            .apply("Transform", ParDo.of(new CustomTransformFunction()))
            .apply("Write", BigQueryIO.writeTableRows()
                    .to(table)
                    .withSchema(schema)
                    .withTimePartitioning(timePartitioning)
            .withMethod(Method.FILE_LOADS)
            .withTriggeringFrequency(org.joda.time.Duration.standardMinutes(2))
            .withNumFileShards(1000)
            .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
            .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
        p.run().waitUntilFinish();

我无法在任何地方找到这样的说明,因为在修复后我找到了一个Jira机票,但是我发现这是一个Jira机票。

https://issues.apache.org/jira/browse/BEAM-3198