我们正在尝试使用Apache Beam和avro写入Big Query。
以下似乎可行: -
p.apply("Input", AvroIO.read(DataStructure.class).from("AvroSampleFile.avro"))
.apply("Transform", ParDo.of(new CustomTransformFunction()))
.apply("Load", BigQueryIO.writeTableRows().to(table).withSchema(schema));
然后我们尝试以下列方式使用它来从Google Pub / Sub
获取数据p.begin()
.apply("Input", PubsubIO.readAvros(DataStructure.class).fromTopic("topicName"))
.apply("Transform", ParDo.of(new CustomTransformFunction()))
.apply("Write", BigQueryIO.writeTableRows()
.to(table)
.withSchema(schema)
.withTimePartitioning(timePartitioning)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
p.run().waitUntilFinish();
当我们这样做时,它总是将它推送到缓冲区,Big Query似乎需要很长时间才能从缓冲区中读取。谁能告诉我为什么以上不会直接将记录写入Big Query表?
更新: - 看起来我需要添加以下设置,但这会抛出java.lang.IllegalArgumentException。
.withMethod(Method.FILE_LOADS)
.withTriggeringFrequency(org.joda.time.Duration.standardMinutes(2))
答案 0 :(得分:1)
答案是你需要包含" withNumFileShards"像这样(可以是1到1000)。
p.begin()
.apply("Input", PubsubIO.readAvros(DataStructure.class).fromTopic("topicName"))
.apply("Transform", ParDo.of(new CustomTransformFunction()))
.apply("Write", BigQueryIO.writeTableRows()
.to(table)
.withSchema(schema)
.withTimePartitioning(timePartitioning)
.withMethod(Method.FILE_LOADS)
.withTriggeringFrequency(org.joda.time.Duration.standardMinutes(2))
.withNumFileShards(1000)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
p.run().waitUntilFinish();
我无法在任何地方找到这样的说明,因为在修复后我找到了一个Jira机票,但是我发现这是一个Jira机票。