Question

我正在使用Google Cloud Dataflow并具有ParDo功能，需要访问PCollection中的所有元素。为了实现这一点，我想转换PCollection＆lt; T＆gt;进入PCollection＆lt; Iterable＆lt; T＆gt;＆gt;包含所有元素的单个Iterable。我想知道是否有更清洁/更简单/更快的解决方案。

第一种方法是创建一个虚拟密钥，执行GroupByKey，然后获取值。

PCollection<MyType> myData;
// AddDummyKey() outputs KV.of(1, context.element()) for everything
PCollection<KV<Integer, MyType>> myDataKeyed = myData.apply(ParDo.of(new AddDummyKey())); 
// Group by dummy key
PCollection<KV<Integer, Iterable<MyType>>> myDataGrouped = myDataKeyed.apply(GroupByKey.create());
// Extract values
PCollection<Iterable<MyType>> myDataIterable = myDataGrouped.apply(Values.<Iterable<MyType>>create()

第二种方法遵循此处的建议：How do I make View's asList() sortable in Google Dataflow SDK?但没有排序。我创建了一个View.asList（），创建了一个虚拟PCollection，然后在虚拟PCollection上应用了一个ParDo函数，并将视图作为侧输入，并简单地返回了视图。

PCollection<MyType> myData;
// Create view of the PCollection as a list
PCollectionView<List<MyType>> myDataView = myData.apply(View.asList()); 
// Create dummy PCollection
PCollection<Integer> dummy = pipeline.apply(Create.<Integer>of(1));
// Apply dummy ParDo that returns the view
PCollection<List<MyType>> myDataList = dummy.apply(
        ParDo.withSideInputs(myDataView).of(new DoFn<Integer, List<MyType>>() {
            @Override
            public void processElement(ProcessContext c) {
                c.output(c.sideInput(myDataView)); 
            }
        }));

似乎这个任务会有一个预定义的组合函数，但我找不到一个。谢谢你的帮助！

Answer 1

如果你知道你需要整个事情，那么你的两种方法都是合理的。两者都已在Dataflow SDK中使用，后来在它成为Apache Beam SDK时使用。

侧输入然后输出整个事物：实际上这就是DataflowAssert的工作原理。在Beam中，不同的后端运行器可能以不同的方式实现侧输入，您应该更喜欢View.asIterable()，因为它具有更少的假设并且可能允许更多的流输入非常大的侧输入。
按一个键分组，然后放下键：这就是Beam的继任者PAssert的工作原理。它完成同样的事情，需要更多关心空集合，但更多的梁运行者比侧输入支持有更好的GroupByKey支持（特别是当它们是新的并且仍在开发中时）。

所以View.asIterable()基本上就是你想要的。还有一些GroupGlobally转换请求执行第二个版本;这可能发生在某个时刻。

将PCollection <t>组合到PCollection中的简单方法<iterable <t>＆gt;

1 个答案: