数据流:如何从另一个管道发出的已存在的PCollection创建管道

时间:2018-06-08 17:50:24

标签: google-cloud-dataflow apache-beam-io

我正在尝试将我的管道分成许多较小的管道,以便它们执行得更快。我正在分割PCollection的Google云存储blob(PCollection),以便我得到一个

    PCollectionList<Blob> collectionList
从那里开始,我希望能够做到这样的事情:

    Pipeline p2 = Pipeline.create(collectionList.get(0));
    .apply(stuff)
    .apply(stuff)

    Pipeline p3 = Pipeline.create(collectionList.get(1));
    .apply(stuff)
    .apply(stuff)

但我还没有找到任何关于从已经存在的PCollection创建初始PCollection的文档,如果有人能指出正确的方向,我将非常感激。 谢谢!

1 个答案:

答案 0 :(得分:0)

您应该查看Partition转换,将PCollection拆分为N个较小的转换。您可以提供PartitionFn来定义拆分的完成方式。您可以在下面找到Beam programming guide

中的示例  
// Provide an int value with the desired number of result partitions, and a PartitionFn that represents the partitioning function.
// In this example, we define the PartitionFn in-line.
// Returns a PCollectionList containing each of the resulting partitions as individual PCollection objects.
PCollection<Student> students = ...;
// Split students up into 10 partitions, by percentile:
PCollectionList<Student> studentsByPercentile =
    students.apply(Partition.of(10, new PartitionFn<Student>() {
        public int partitionFor(Student student, int numPartitions) {
            return student.getPercentile()  // 0..99
                 * numPartitions / 100;
        }}));

// You can extract each partition from the PCollectionList using the get method, as follows:
PCollection<Student> fortiethPercentile = studentsByPercentile.get(4);