Beam.BigQueryIO:numFileShards的用途是什么?

时间:2018-09-10 20:21:55

标签: google-cloud-dataflow apache-beam

当我要将未绑定的输入从Kafka加载到BigQuery时,遇到了.withMethod()选项。通过使用Method.FILE_LOAD,我还必须指定触发频率以及非零的numFileShards。

我的问题是

  1. 文件分片数量控制什么?它到底是干什么用的?据我观察,这绝对不是在我的GCS临时位置生成并对我可见的临时文件的编号。但是我想知道应该在此处设置多少数字?
  2. 根据源代码,我引用了下面的内容,默认值应该是1000,但实际上是0,所以当我没有明确设置它并将其设置为1时,出现了异常,但我还是不明白它的含义以及我要设置的内容,大声笑
/**Control how many file shards are written when using BigQuery load jobs. 
 Applicable only when also setting {@link/#withTriggeringFrequency}.
 The default value is 1000.*/

@Experimental
public Write<T> withNumFileShards(int numFileShards) {
  checkArgument(numFileShards > 0, "numFileShards must be > 0, but was: %s", numFileShards);
  return toBuilder().setNumFileShards(numFileShards).build();
}
  1. 有没有一种方法可以通过记录计数而不是持续时间指定批处理大小?

我未设置NumFileShards时遇到的异常:

Exception in thread "main" java.lang.IllegalArgumentException
    at com.google.common.base.Preconditions.checkArgument(Preconditions.java:108)
    at org.apache.beam.sdk.io.gcp.bigquery.BatchLoads.expandTriggered(BatchLoads.java:212)
    at org.apache.beam.sdk.io.gcp.bigquery.BatchLoads.expand(BatchLoads.java:557)
    at org.apache.beam.sdk.io.gcp.bigquery.BatchLoads.expand(BatchLoads.java:79)
    at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537)
    at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:471)
    at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:325)
    at org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Write.expandTyped(BigQueryIO.java:1656)
    at org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Write.expand(BigQueryIO.java:1602)
    at org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Write.expand(BigQueryIO.java:1068)
    at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537)
    at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:488)
    at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:338)
    at come.geotab.bigdata.streaming.mapenrichedgps.MainApplication.main(MainApplication.java:119)

1 个答案:

答案 0 :(得分:1)

将数据写入BigQuery的工作方式可能不同。 A-Z意味着Beam会将您的窗口-写入Avro文件,然后将触发BigQuery作业以导入这些文件的内容。

文件分片的数量控制着FILE_LOAD将被写入多少文件,从而控制BQ导入作业的并行性。

希望有帮助!