Apache Beam - 在两个无界PCollections上按键加入流

时间:2017-10-07 08:57:57

标签: apache google-cloud-dataflow apache-beam

我有两个无界(KafkaIOPCollections为其应用基于CoGroupByKey的标签,固定窗口为1分钟,但在大多数情况下加入收集时似乎错过了一些具有相同键的测试数据的标记数据。请找到以下代码段。

    KafkaIO.Read<Integer, String> event1 = ... ;


    KafkaIO.Read<Integer, String> event2 = ...;

    PCollection<KV<String,String>> event1Data = p.apply(event1.withoutMetadata())
            .apply(Values.<String>create())
            .apply(MapElements.via(new SimpleFunction<String, KV<String, String>>() {
                @Override public KV<String, String> apply(String input) {
                    log.info("Extracting Data");
                    . . . .//Some processing
                    return KV.of(record.get("myKey"), record.get("myValue"));
                }
            }))
            .apply(Window.<KV<String,String>>into(
                    FixedWindows.of(Duration.standardMinutes(1))));

    PCollection<KV<String,String>> event2Data = p.apply(event2.withoutMetadata())
            .apply(Values.<String>create())
            .apply(MapElements.via(new SimpleFunction<String, KV<String, String>>() {
                @Override public KV<String, String> apply(String input) {
                    log.info("Extracting Data");
                    . . . .//Some processing
                    return KV.of(record.get("myKey"), record.get("myValue"));
                }
            }))
            .apply(Window.<KV<String,String>>into(
                    FixedWindows.of(Duration.standardMinutes(1))));

   final TupleTag<String> event1Tag = new TupleTag<>();
   final TupleTag<String> event2Tag = new TupleTag<>();

   PCollection<KV<String, CoGbkResult>> kvpCollection = KeyedPCollectionTuple
            .of(event1Tag, event1Data)
            .and(event2Tag, event2Data)
            .apply(CoGroupByKey.<String>create());

   PCollection<String> finalResultCollection =
            kvpCollection.apply("Join", ParDo.of(
                    new DoFn<KV<String, CoGbkResult>, String>() {
                        @ProcessElement
                        public void processElement(ProcessContext c) throws IOException {
                            KV<String, CoGbkResult> e = c.element();
                            Iterable<String> event1Values = e.getValue().getAll(event1Tag);
                            Iterable<String> event2Values = e.getValue().getAll(event2Tag);
                            if( event1.iterator().hasNext() && event2.iterator().hasNext() ){
                               // Process event1 and event2 data and write to c.output
                            }else {
                                System.out.println("Unable to join event1 and event2");
                            }
                        }
                    }));

对于上面的代码,当我开始用两个kafka主题的公共密钥抽取数据时,它永远不会加入即Unable to join event1 and event2,如果我做错了或者有更好的加入方式,请告诉我共同密钥上的两个无界PCollection

2 个答案:

答案 0 :(得分:1)

我最近有类似的问题。根据梁文档,要在无边界PCollection(特别是键值PCollection)上使用CoGroupByKey转换,所有PCollection应该具有相同的窗口和触发策略。因此,由于您要使用流式/无界集合,因此您必须根据触发策略使用“触发”在一定间隔后触发并发出窗口输出。由于您正在此处处理流数据,因此该触发器应连续触发,即永久重复使用触发器。您还需要在窗口化的PCollection上应用“累积/丢弃”选项,以告诉Beam在触发触发器后应执行的操作,即累积丢弃窗口窗格的结果。使用此窗口,触发和累加策略后,应使用CoGroupByKey变换使用公共密钥对多个无界PCollection进行分组。

类似这样的东西:

PCollection<KV<String, Employee>> windowedCollection1
                    = collection1.apply(Window.<KV<String, DeliveryTimeWindow>>into(FixedWindows.of(Duration.standardMinutes(5)))
                    .triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(1)))
                    .withAllowedLateness(Duration.ZERO).accumulatingFiredPanes());


PCollection<KV<String, Department>> windowedCollection2
                    = collection2.apply(Window.<KV<String, DeliveryTimeWindow>>into(FixedWindows.of(Duration.standardMinutes(5)))
                    .triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(1)))
                    .withAllowedLateness(Duration.ZERO).accumulatingFiredPanes());

然后使用CoGroupByKey:

final TupleTag<Employee> t1 = new TupleTag<>();
final TupleTag<Department> t2 = new TupleTag<>();

PCollection<KV<String, CoGbkResult>> groupByKeyResult =
                    KeyedPCollectionTuple.of(t1,windowedCollection1)
.and(t2,windowedCollection2) 
                            .apply("Join Streams", CoGroupByKey.create());

现在您可以在ParDo转换中处理分组的PCollection。

希望这会有所帮助!

答案 1 :(得分:0)

我想我有点想出了这个问题,默认触发器是在CoGroupByKey触发了两个Unbounded源,因此当有一个新事件到达这两个源时,它试图应用连接操作立即,因为没有为我的蒸汽连接管道配置数据驱动触发器。我将所需的triggering() discardingFiredPanes() withAllowedLateness()属性配置到我的Window函数,该函数解决了我的流连接用例。