Question

我试图处理来自输入存储桶的Beam管道中的PDF文件，并将结果，输入和中间文件全部输出到单独的输出存储桶。

My pipeline

所有三个输出的文件名都来自最后一步，输入文件与输出文件名之间存在1：1映射，因此我不想在输出文件名中使用分片模板（我的UniquePrefixFileNaming类是做与TextIO.withoutSharding（）相同的事情

由于仅在最后一步才知道文件名，所以我认为我无法在之前的每个处理步骤中设置带标签的输出和输出文件-我必须一直在管道中传送数据。

实现此目标的最佳方法是什么？下面是我对这个问题的尝试-文本输出可以正常工作，但是我没有PDF输出的解决方案（没有可用的二进制输出接收器，没有通过的二进制数据）。 FileIO.writeDynamic是最好的方法吗？

Pipeline p = Pipeline.create();

        PCollection<MyProcessorTransformResult> transformCollection = p.apply(FileIO.match().filepattern("Z:\\Inputs\\en_us\\**.pdf"))
                .apply(FileIO.readMatches())
                .apply(TikaIO.parseFiles())
                .apply(ParDo.of(new MyProcessorTransform()));

        // Write output PDF
        transformCollection.apply(FileIO.match().filepattern())
        transformCollection.apply(FileIO.<String, MyProcessorTransformResult>writeDynamic()
                .withTempDirectory("Z:\\Temp\\vbeam")
                .by(input -> input.data.getResourceKey())
                .via(
                        Contextful.fn((SerializableFunction<MyProcessorTransformResult, byte[]>) input -> new byte[] {})
                )
                .withNaming(d -> new UniquePrefixFileNaming(d, ".pdf"))
                .withNumShards(1)
                .withDestinationCoder(ByteArrayCoder.of())
                .to("Z:\\Outputs"));

        // Write output TXT
        transformCollection.apply(FileIO.<String, MyProcessorTransformResult>writeDynamic()
                .withTempDirectory("Z:\\Temp\\vbeam")
                .by(input -> input.data.getResourceKey())
                .via(
                        Contextful.fn((SerializableFunction<MyProcessorTransformResult, String>) input -> input.originalContent),
                        TextIO.sink()
                )
                .withNaming(d -> new UniquePrefixFileNaming(d, ".pdf.txt"))
                .withNumShards(1)
                .withDestinationCoder(StringUtf8Coder.of())
                .to("Z:\\Outputs"));

        // Write output JSON
        transformCollection.apply(FileIO.<String, MyProcessorTransformResult>writeDynamic()
                .withTempDirectory("Z:\\Temp\\vbeam")
                .by(input -> input.data.getResourceKey())
                .via(
                        Contextful.fn((SerializableFunction<MyProcessorTransformResult, String>) input -> SerializationHelpers.toJSON(input.data)),
                        TextIO.sink()
                )
                .withNaming(d -> new UniquePrefixFileNaming(d, ".pdf.json"))
                .withNumShards(1)
                .withDestinationCoder(StringUtf8Coder.of())
                .to("Z:\\Outputs"));

        p.run();

Answer 1

我最终编写了自己的文件接收器，该文件接收器保存了所有3个输出。 FileIO非常适合流式传输，具有Windows和Panes来拆分数据，-我的接收器步骤一直用尽内存，因为批处理作业在Beam中的单个Window中运行，因为它会尝试在进行任何实际写入之前聚合所有内容。我的自定义DoFn没有此类问题。

对于任何研究此问题的人，我的建议是做同样的事情-您可以尝试加入Beam的Filesystems类，或查看jclouds以了解与文件系统无关的存储。

设计此Apache Beam转换以输出包括二进制输出在内的多个文件的理想方法是什么？

1 个答案: