我正在尝试将我的管道分成许多较小的管道,以便它们执行得更快。我正在分割PCollection的Google云存储blob(PCollection),以便我得到一个
PCollectionList<Blob> collectionList
从那里开始,我希望能够做到这样的事情:
Pipeline p2 = Pipeline.create(collectionList.get(0));
.apply(stuff)
.apply(stuff)
Pipeline p3 = Pipeline.create(collectionList.get(1));
.apply(stuff)
.apply(stuff)
但我还没有找到任何关于从已经存在的PCollection创建初始PCollection的文档,如果有人能指出正确的方向,我将非常感激。 谢谢!
答案 0 :(得分:0)
您应该查看Partition
转换,将PCollection拆分为N个较小的转换。您可以提供PartitionFn来定义拆分的完成方式。您可以在下面找到Beam programming guide:
// Provide an int value with the desired number of result partitions, and a PartitionFn that represents the partitioning function.
// In this example, we define the PartitionFn in-line.
// Returns a PCollectionList containing each of the resulting partitions as individual PCollection objects.
PCollection<Student> students = ...;
// Split students up into 10 partitions, by percentile:
PCollectionList<Student> studentsByPercentile =
students.apply(Partition.of(10, new PartitionFn<Student>() {
public int partitionFor(Student student, int numPartitions) {
return student.getPercentile() // 0..99
* numPartitions / 100;
}}));
// You can extract each partition from the PCollectionList using the get method, as follows:
PCollection<Student> fortiethPercentile = studentsByPercentile.get(4);