Question

我要做什么：

使用Apache Beam Streaming管道和Dataflow Runner从PubSub订阅中获取json消息
将有效载荷字符串解组为对象。
- 假设“ messageId”是传入消息的唯一ID。例如：msgid1，msgid2等
从数据库中检索由＃2导致的每个对象的子记录。同一孩子可以适用于多封邮件。
- 将'childId'假定为子记录的唯一ID。例如：cid1234，cid1235等
按子记录的唯一ID对子记录进行分组，如下例所示
- KV.of（cid1234，Map.of（msgid1，msgid2））和KV.of（cid1235，Map.of（msgid1，msgid2））
将childId级别的分组结果写入数据库

问题：

应在何处引入窗口？我们目前在第1步之后有30分钟的固定窗口时间
Beam如何定义30分钟窗口的开始和结束时间？是在启动管道之后还是在批处理的第一条消息之后？
如果第2步到第5步花费一个窗口超过1小时并且下一个窗口批处理准备就绪，该怎么办。这两个Windows批处理是否可以并行处理？
如何使下一个窗口消息等到上一个窗口批处理完成？
- 如果我们不这样做，则下一批将覆盖childId级别的结果

代码段：

         PCollection<PubsubMessage> messages = pipeline.apply("ReadPubSubSubscription",
             PubsubIO.readMessagesWithAttributes()
                 .fromSubscription("projects/project1/subscriptions/subscription1"));

         PCollection<PubsubMessage> windowedMessages = messages.apply(Window.into(FixedWindows
             .of(Duration.standardMinutes(30))));
             
         PCollectionTuple unmarshalResultTuple = windowedMessages.apply("UnmarshalJsonStrings",
             ParDo.of(new JsonUnmarshallFn())
                 .withOutputTags(JsonUnmarshallFn.mainOutputTag,
                     TupleTagList.of(JsonUnmarshallFn.deadLetterTag)));

         PCollectionTuple childRecordsTuple = unmarshalResultTuple
             .get(JsonUnmarshallFn.mainOutputTag)
             .apply("FetchChildsFromDBAndProcess",
                 ParDo.of(new ChildsReadFn() )
                     .withOutputTags(ChildsReadFn.mainOutputTag,
                         TupleTagList.of(ChildsReadFn.deadLetterTag)));

         // input is KV of (childId, msgids), output is mutations to write to BT
         PCollectionTuple postProcessTuple = childRecordsTuple
             .get(ChildsReadFn.mainOutputTag)
             .apply(GroupByKey.create())
             .apply("UpdateChildAssociations",
                 ParDo.of(new ChildsProcessorFn())
                     .withOutputTags(ChildsProcessorFn.mutations,
                         TupleTagList.of(ChildsProcessorFn.deadLetterTag)));

         postProcessTuple.get(ChildsProcessorFn.mutations).CloudBigtableIO.write(...);

Answer 1

回答每个问题。

关于问题 1 和 2 ，当您在Apache Beam中使用窗口时，您需要了解“窗口在工作之前已经存在”。我的意思是Windows从UNIX时代开始（时间戳= 0）。换句话说，您的数据将在每个固定的时间范围内分配，例如具有固定的60秒窗口：

  PCollection<String> items = ...;
    PCollection<String> fixedWindowedItems = items.apply(
        Window.<String>into(FixedWindows.of(Duration.standardSeconds(60))));

第一个窗口： [0s; 59s）-第二个窗口： [60s; 120s） ...等等 请参阅文档1，2和3

关于问题 3 ，Apache Beam中的窗口化和触发的默认设置是忽略最新数据。尽管可以使用withAllowedLateness配置对后期数据的处理。为此，必须先了解水印的概念。水印是衡量数据落后程度的指标。 示例：您可以设置3秒钟的水印，然后，如果您的数据晚3秒钟，它将被分配到右侧窗口。另一方面，如果传递了水印，则可以定义此数据将要发生的情况，可以使用Triggers重新处理或忽略它。

具有允许的延迟

  PCollection<String> items = ...;
    PCollection<String> fixedWindowedItems = items.apply(
        Window.<String>into(FixedWindows.of(Duration.standardMinutes(1)))
              .withAllowedLateness(Duration.standardDays(2)));

请注意，已设置了一定的时间以使迟到的数据到达。

触发

PCollection<String> pc = ...;
pc.apply(Window.<String>into(FixedWindows.of(1, TimeUnit.MINUTES))  
                            .triggering(AfterProcessingTime.pastFirstElementInPane()                                                                .plusDelayOf(Duration.standardMinutes(1)))          
                .withAllowedLateness(Duration.standardMinutes(30));

请注意，在出现延迟数据的情况下，窗口将被重新处理并重新计算。此触发器使您有机会对最新数据做出反应。

最后，是关于问题 4 的问题，部分内容已通过上述概念进行了解释。计算将在每个固定窗口内进行，并在每次触发触发器时重新计算/处理。这种逻辑将确保您的数据在正确的窗口中。

具有顺序批处理的Apache Beam Streaming管道

1 个答案: