Question

有人知道在google-cloud-dataflow中使用文件模式匹配时如何获取文件名？

我是newbee使用数据流。以这种方式使用文件模式匹配时如何获取文件名。

p.apply(TextIO.Read.from("gs://dataflow-samples/shakespeare/*.txt"))

我想知道如何检测kinglear.txt，Hamlet.txt等的文件名

Answer 1

如果您只想展开文件模式并获取与其匹配的文件名列表，则可以使用GcsIoChannelFactory.match("gs://dataflow-samples/shakespeare/*.txt")（请参阅GcsIoChannelFactory）。

如果您想从管道中的一个DoFn下游访问“当前文件名” - 目前不支持（尽管有一些解决方法 - 见下文）。这是一个常见的功能请求，我们仍在思考如何以自然，通用和高性能的方式将其融入框架中。

一些解决方法包括：

编写这样的管道（tf-idf示例使用此方法）：

    DoFn readFile = ...(takes a filename, reads the file and produces records)...
    p.apply(Create.of(filenames))
     .apply(ParDo.of(readFile))
     .apply(the rest of your pipeline)

这有一个缺点，即动态工作重新平衡功能不会特别好用，因为它们目前仅适用于Read PTransform的级别，但不适用于具有高扇出的ParDo级别（如此处的那个，会读取文件并生成所有记录）;并行化仅适用于文件级别，但文件不会拆分为子范围。在阅读莎士比亚的规模上，这不是一个问题，但如果你正在阅读一组截然不同的文件，有些文件非常大，那么它可能会成为一个问题。

实施您自己的FileBasedSource（javadoc，general documentation），它会返回Pair<String, T>类型的记录，其中String是文件名，{ {1}}是您正在阅读的记录。在这种情况下，框架将为您处理文件模式匹配，动态工作重新平衡可以正常工作，但是您可以在T中编写阅读逻辑。

这两种解决方法都不理想，但根据您的要求，其中一种可能会为您解决问题。

Answer 2

一种方法是构建List<PCollection>，其中每个条目对应一个输入文件，然后使用Flatten。例如，如果要将文件集合的每一行解析为Foo对象，可以执行以下操作：

public static class FooParserFn extends DoFn<String, Foo> {
  private String fileName;
  public FooParserFn(String fileName) {
    this.fileName = fileName;
  }

  @Override
  public void processElement(ProcessContext processContext) throws Exception {
    String line = processContext.element();
    // here you have access to both the line of text and the name of the file
    // from which it came.
  }
}

public static void main(String[] args) {
  ...
  List<String> inputFiles = ...;
  List<PCollection<Foo>> foosByFile =
          Lists.transform(inputFiles,
          new Function<String, PCollection<Foo>>() {
            @Override
            public PCollection<Foo> apply(String fileName) {
              return p.apply(TextIO.Read.from(fileName))
                      .apply(new ParDo().of(new FooParserFn(fileName)));
            }
          });

  PCollection<Foo> foos = PCollectionList.<Foo>empty(p).and(foosByFile).apply(Flatten.<Foo>pCollections());
  ...
}

这种方法的一个缺点是，如果您有100个输入文件，那么您将在Cloud Dataflow监控控制台中拥有100个节点。这使得很难分辨出发生了什么。我有兴趣听听谷歌云数据流人员这种方法是否有效。

Answer 3

当使用类似于@danvk的代码时，我在数据流图上也有100个输入文件= 100个节点。我切换到这样的方法，导致所有读取组合成一个块，您可以扩展以深入到读取的每个文件/目录中。在我们的用例中，使用这种方法而不是Lists.transform方法也可以更快地完成工作。

GcsOptions gcsOptions = options.as(GcsOptions.class);
List<GcsPath> paths = gcsOptions.getGcsUtil().expand(GcsPath.fromUri(options.getInputFile()));
List<String>filesToProcess = paths.stream().map(item -> item.toString()).collect(Collectors.toList());

PCollectionList<SomeClass> pcl = PCollectionList.empty(p);
for(String fileName : filesToProcess) {
    pcl = pcl.and(
            p.apply("ReadAvroFile" + fileName, AvroIO.Read.named("ReadFromAvro")
                    .from(fileName)
                    .withSchema(SomeClass.class)
            )
            .apply(ParDo.of(new MyDoFn(fileName)))
    );
}

// flatten the PCollectionList, combining all the PCollections together
PCollection<SomeClass> flattenedPCollection = pcl.apply(Flatten.pCollections());

Answer 4

基于最新的SDK更新 Java（sdk 2.9.0）：

Beams TextIO读取器无法访问文件名本身，在这些用例中，我们需要使用FileIO来匹配文件并访问文件名中存储的信息。与TextIO不同，在FileIO读取的下游转换中，用户需要注意文件的读取。读取FileIO的结果是PCollection，ReadableFile类包含作为元数据的文件名，可以与文件内容一起使用。

FileIO确实具有方便的方法readFullyAsUTF8String（），它将整个文件读入String对象，这将首先将整个文件读入内存。如果需要考虑内存，则可以使用实用程序类（如FileSystems）直接处理文件。

发件人：Document Link

PCollection<KV<String, String>> filesAndContents = p
     .apply(FileIO.match().filepattern("hdfs://path/to/*.gz"))
     // withCompression can be omitted - by default compression is detected from the filename.
     .apply(FileIO.readMatches().withCompression(GZIP))
     .apply(MapElements
         // uses imports from TypeDescriptors
         .into(KVs(strings(), strings()))
         .via((ReadableFile f) -> KV.of(
             f.getMetadata().resourceId().toString(), f.readFullyAsUTF8String())));

Python（sdk 2.9.0）：

对于2.9.0 for python，您将需要从Dataflow管道外部收集URI列表，并将其作为参数输入到管道中。例如，利用FileSystems通过Glob模式读取文件列表，然后将其传递给PCollection进行处理。

一旦看到文件PR https://github.com/apache/beam/pull/7791/可用，以下代码也将是python的选项。

import apache_beam as beam
from apache_beam.io import fileio

with beam.Pipeline() as p:
  readable_files = (p 
                    | fileio.MatchFiles(‘hdfs://path/to/*.txt’)
                    | fileio.ReadMatches()
                    | beam.Reshuffle())
  files_and_contents = (readable_files 
                        | beam.Map(lambda x: (x.metadata.path, 
                                              x.read_utf8()))

Answer 5

对于以上问题，这可能是一个很晚的帖子，但是我想添加Beam捆绑类的答案。

这也可以看作是@Reza Rokni提供的解决方案中的摘录代码。

PCollection<String> listOfFilenames =
    pipe.apply(FileIO.match().filepattern("gs://apache-beam-samples/shakespeare/*"))
        .apply(FileIO.readMatches())
        .apply(
            MapElements.into(TypeDescriptors.strings())
                .via(
                    (FileIO.ReadableFile file) -> {
                      String f = file.getMetadata().resourceId().getFilename();
                      System.out.println(f);
                      return f;
                    }));

pipe.run().waitUntilFinish();

PCollection<String>上方将在任何提供的目录中提供可用文件列表。

如何在google-cloud-dataflow中使用文件模式匹配时获取文件名

5 个答案: