Question

我正在玩BigQueryIO使用负载编写。我的加载触发器设置为18小时。我用固定的每日窗口从Kafka中提取数据。

基于https://github.com/apache/beam/blob/v2.2.0/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L213-L231，当窗格中至少有500k条记录时，似乎预期的行为是将行卸载到文件系统

我设法生成约600K记录并等待大约2个小时来查看行是否上传到gcs，但是，没有任何内容。我注意到＆＃34; GroupByDestination＆＃34;进入＆＃34; BatchLoads＆＃34;显示0＆＃34;输出集合＆＃34;大小

当我使用较小的负载触发器时，一切似乎都很好。不应该触发AfterPane.elementCountAtLeast(FILE_TRIGGERING_RECORD_COUNT))))吗？

以下是写入BigQuery的代码

  BigQueryIO
  .writeTableRows()
  .to(new SerializableFunction[ValueInSingleWindow[TableRow], TableDestination]() {
    override def apply(input: ValueInSingleWindow[TableRow]): TableDestination = {
      val startWindow = input.getWindow.asInstanceOf[IntervalWindow].start()
      val dayPartition = DateTimeFormat.forPattern("yyyyMMdd").withZone(DateTimeZone.UTC).print(startWindow)
      new TableDestination("myproject_id:mydataset_id.table$" + dayPartition, null)
    }
  })
  .withMethod(Method.FILE_LOADS)
  .withCreateDisposition(CreateDisposition.CREATE_NEVER)
  .withWriteDisposition(WriteDisposition.WRITE_APPEND)
  .withSchema(BigQueryUtils.schemaOf[MySchema])
  .withTriggeringFrequency(Duration.standardHours(18))
  .withNumFileShards(10)

职位编号为2018-02-16_14_34_54-7547662103968451637。提前谢谢。

Answer 1

窗格是每个窗口的每个键，并且具有动态目标的BigQueryIO.write（）使用目标作为引擎盖下的键，因此窗格中的＆＃34; 500k元素＆＃34;每个目的地每个窗口适用的东西。

早期触发发生时，BigQueryIO加载不将行卸载到GCS

1 个答案: