Question

我刚刚开始使用数据流，我对如何实现分支几乎没有疑问。

假设我有一个单词流，如果我想过滤每个字母开头的所有单词，我该如何实现它。我应该对每个数据应用过滤器并将其分配给PCollection吗？如果是这样，那么对于每个过滤器，我将读取整个数据流，这是没有用的，我将不得不创建26个PCollection来获得以每个字母开头的字母。有没有更好的方法来做到这一点而不迭代相同的数据？

另外，如果我想对几个字母表应用窗口并直接传输其余部分，我应该怎么做。

感谢并感谢您的帮助。

Answer 1

您可以使用Partition转换将数据划分为多个子PCollection，而无需对数据进行多次迭代。然后，您可以将其他变换和窗口分别应用于分区的不同输出。

例如：

PCollection<Student> students = ...;
// Split students up into 10 partitions, by percentile:
PCollectionList<Student> studentsByPercentile =
    students.apply(Partition.of(10, new PartitionFn<Student>() {
        public int partitionFor(Student student, int numPartitions) {
            return student.getPercentile()  // 0..99
                 * numPartitions / 100;
        }}))
for (int i = 0; i < 10; i++) {
  PCollection<Student> partition = studentsByPercentile.get(i);
  ...
}

数据分支和应用转换

1 个答案: