Question

根据Apache Beam 2.0.0 SDK Documentation GroupIntoBatches，仅适用于KV个收藏。

我的数据集只包含值，不需要引入密钥。但是，要使用GroupIntoBatches，我必须使用空字符串作为键来实现“假”键：

static class FakeKVFn extends DoFn<String, KV<String, String>> {
  @ProcessElement
  public void processElement(ProcessContext c) {
    c.output(KV.of("", c.element()));
  }
}

因此总体管道如下所示：

public static void main(String[] args) {
  PipelineOptions options = PipelineOptionsFactory.create();
  Pipeline p = Pipeline.create(options);

  long batchSize = 100L;

  p.apply("ReadLines", TextIO.read().from("./input.txt"))
      .apply("FakeKV", ParDo.of(new FakeKVFn()))
      .apply(GroupIntoBatches.<String, String>ofSize(batchSize))
      .setCoder(KvCoder.of(StringUtf8Coder.of(), IterableCoder.of(StringUtf8Coder.of())))
      .apply(ParDo.of(new DoFn<KV<String, Iterable<String>>, String>() {
        @ProcessElement
        public void processElement(ProcessContext c) {
          c.output(callWebService(c.element().getValue()));
        }
      }))
      .apply("WriteResults", TextIO.write().to("./output/"));

  p.run().waitUntilFinish();
}

有没有办法在不引入“假”钥匙的情况下分组？

Answer 1

需要向GroupIntoBatches提供KV输入，因为转换是使用状态和定时器实现的，每个按键和窗口都是如此。

对于每个键+窗口对，状态和定时器必须串行执行（或可观察到）。您必须通过提供密钥（和窗口，但我没有知道今天在Windows上并行化的转轮）来手动表达可用的并行性。两种最常见的方法是：

使用一些自然键，如用户ID
选择一些固定数量的分片并随机键入。这可能更难调整。您必须有足够的分片才能获得足够的并行度，但每个分片都需要包含GroupIntoBatches实际上有用的足够数据。

在代码段中为所有元素添加一个虚拟密钥将导致转换根本不会并行执行。这类似于Stateful indexing causes ParDo to be run single-threaded on Dataflow Runner的讨论。

GroupIntoBatches用于非KV元素

1 个答案: