Question

我有一个简单的DataFlow java作业，它从.csv文件读取几行。每行包含一个数字单元格，表示必须在该行上执行某个函数的步数。

我不想在函数中使用传统的For循环执行该操作，以防这些数字变得非常大。使用并行友好的DataFlow方法执行此操作的正确方法是什么？

这是当前的Java代码：

public class SimpleJob{

    static class MyDoFn extends DoFn<String, Integer> {

        public void processElement(ProcessContext c) {
            String name = c.element().split("\\,")[0];
            int val = Integer.valueOf(c.element().split("\\,")[1]);
            for (int i = 0; i < val; i++) // <- what's the preferred way to do this in DF?
                System.out.println("Processing some function: " + name); // <- do something
            c.output(val);
        }

    }

    public static void main() {

        DataflowPipelineOptions options = PipelineOptionsFactory
                .as(DataflowPipelineOptions.class);
        options.setProject(DEF.ID_PROJ);
        options.setStagingLocation(DEF.ID_STG_LOC);
        options.setRunner(DirectPipelineRunner.class);

        Pipeline pipeline = Pipeline.create(options);

        pipeline.apply(TextIO.Read.from("Source.csv"))
                .apply(ParDo.of(new MyDoFn()));

        pipeline.run();
    }
}

这就是“source.csv”的样子（因此每个数字代表我想在该线上运行并行功能的次数）：

乔，3
玛丽，4
彼得，2

Answer 1

奇怪的是，这是Splittable DoFn的激励用例之一！该API目前正在大力发展。

但是，在该API可用之前，您基本上可以模仿其为您所做的大部分工作：

return

其中：

“拆分大量重复”是一个DoFn，例如，将class ElementAndRepeats { String element; int numRepeats; } PCollection<String> lines = p.apply(TextIO.Read....) PCollection<ElementAndRepeats> elementAndNumRepeats = lines.apply( ParDo.of(...parse number of repetitions from the line...)); PCollection<ElementAndRepeats> elementAndNumSubRepeats = elementAndNumRepeats .apply(ParDo.of( ...split large numbers of repetitions into smaller numbers...)) .apply(...fusion break...); elementAndNumSubRepeats.apply(ParDo.of(...execute the repetitions...))拆分为ElementAndRepeats{"foo", 34}
融合中断 - 见here，防止几个ParDo融合在一起，打败并行化

在Google DataFlow管道中创建并行For循环的正确方法

1 个答案: