Question

我通读了可拆分的DoFn博客，据我了解，此功能已在TextIO中提供（适用于Cloud数据流运行器）。我不清楚的是-使用TextIO，我将能够并行读取给定文件中的行。

Answer 1

仅对于Java，TextIO源将自动并行并行读取未压缩的文件。

这没有正式记录，但是TextIO源是允许搜索的FileBaseSource的子类。这意味着，如果工人决定拆分工作，则可以这样做。请参见用于拆分here的FileBasedSource代码。

Answer 2

Cubez的回答很好。我还想补充一点，既是PTransform又是I / O连接器的TextIO实现了expand（）方法：

@Override
public PCollection<String> expand(PCollection<FileIO.ReadableFile> input) {
  return input.apply(
      "Read all via FileBasedSource",
      new ReadAllViaFileBasedSource<>(
          getDesiredBundleSizeBytes(),
          new CreateTextSourceFn(getDelimiter()),
          StringUtf8Coder.of()));
}

如果进一步看，我们可以看到ReadAllViaFileBasedSource类还具有定义如下的expand（）方法：

@Override
public PCollection<T> expand(PCollection<ReadableFile> input) {
return input
    .apply("Split into ranges", ParDo.of(new SplitIntoRangesFn(desiredBundleSizeBytes)))
    .apply("Reshuffle", Reshuffle.viaRandomKey())
    .apply("Read ranges", ParDo.of(new ReadFileRangesFn<>(createSource)))
    .setCoder(coder);

}

这是基础运行程序如何在执行程序之间分发PCollection并并行读取的方式。

最新版本的TextIO（2.11及更高版本）是否可以从文件中并行读取行？

2 个答案: