当我要将未绑定的输入从Kafka加载到BigQuery时,遇到了.withMethod()
选项。通过使用Method.FILE_LOAD,我还必须指定触发频率以及非零的numFileShards。
我的问题是:
/**Control how many file shards are written when using BigQuery load jobs. Applicable only when also setting {@link/#withTriggeringFrequency}. The default value is 1000.*/ @Experimental public Write<T> withNumFileShards(int numFileShards) { checkArgument(numFileShards > 0, "numFileShards must be > 0, but was: %s", numFileShards); return toBuilder().setNumFileShards(numFileShards).build(); }
我未设置NumFileShards时遇到的异常:
Exception in thread "main" java.lang.IllegalArgumentException
at com.google.common.base.Preconditions.checkArgument(Preconditions.java:108)
at org.apache.beam.sdk.io.gcp.bigquery.BatchLoads.expandTriggered(BatchLoads.java:212)
at org.apache.beam.sdk.io.gcp.bigquery.BatchLoads.expand(BatchLoads.java:557)
at org.apache.beam.sdk.io.gcp.bigquery.BatchLoads.expand(BatchLoads.java:79)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:471)
at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:325)
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Write.expandTyped(BigQueryIO.java:1656)
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Write.expand(BigQueryIO.java:1602)
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Write.expand(BigQueryIO.java:1068)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:488)
at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:338)
at come.geotab.bigdata.streaming.mapenrichedgps.MainApplication.main(MainApplication.java:119)
答案 0 :(得分:1)
将数据写入BigQuery的工作方式可能不同。 A-Z
意味着Beam会将您的窗口-
写入Avro文件,然后将触发BigQuery作业以导入这些文件的内容。
文件分片的数量控制着FILE_LOAD
将被写入多少文件,从而控制BQ导入作业的并行性。
希望有帮助!