是否可以使用两者中的密钥加入两个单独的PubSubIo无界PCollections?我试着通过以下方式完成任务:
Read(FistStream)& Read(SecondStream) - >展平 - >生成用于加入的密钥 - >使用会话窗口将它们收集在一起 - >按键分组,然后使用固定大小的窗口重新关闭 - >使用窗口将AvroIOWrite转换为磁盘。
修改
这是我创建的管道代码。我遇到两个问题:
我需要每秒处理约10 000次会话。每个会话包含1或2个事件,然后需要关闭。
PubsubIO.Read<String> auctionFinishedReader = PubsubIO.readStrings().withTimestampAttribute(TIMESTAMP_ATTRIBUTE)
.fromTopic("projects/authentic-genre-152513/topics/auction_finished");
PubsubIO.Read<String> auctionAcceptedReader = PubsubIO.readStrings().withTimestampAttribute(TIMESTAMP_ATTRIBUTE)
.fromTopic("projects/authentic-genre-152513/topics/auction_accepted");
PCollection<String> auctionFinishedStream = p.apply("ReadAuctionFinished", auctionFinishedReader);
PCollection<String> auctionAcceptedStream = p.apply("ReadAuctionAccepted", auctionAcceptedReader);
PCollection<String> combinedEvents = PCollectionList.of(auctionFinishedStream)
.and(auctionAcceptedStream).apply(Flatten.pCollections());
PCollection<KV<String, String>> keyedAuctionFinishedStream = combinedEvents
.apply("AddKeysToAuctionFinished", WithKeys.of(new GenerateKeyForEvent()));
PCollection<KV<String, Iterable<String>>> sessions = keyedAuctionFinishedStream
.apply(Window.<KV<String, String>>into(Sessions.withGapDuration(Duration.standardMinutes(1)))
.withTimestampCombiner(TimestampCombiner.END_OF_WINDOW))
.apply(GroupByKey.create());
PCollection<SodaSession> values = sessions
.apply(ParDo.of(new DoFn<KV<String, Iterable<String>>, SodaSession> () {
@ProcessElement
public void processElement(ProcessContext c, BoundedWindow window) {
c.output(new SodaSession("auctionid", "stattedat"));
}
}));
PCollection<SodaSession> windowedEventStream = values
.apply("ApplyWindowing", Window.<SodaSession>into(FixedWindows.of(Duration.standardMinutes(2)))
.triggering(Repeatedly.forever(
AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardMinutes(1))
))
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes()
);
AvroIO.Write<SodaSession> avroWriter = AvroIO
.write(SodaSession.class)
.to("gs://storage/")
.withWindowedWrites()
.withFilenamePolicy(new EventsToGCS.PerWindowFiles("sessionsoda"))
.withNumShards(3);
windowedEventStream.apply("WriteToDisk", avroWriter);
答案 0 :(得分:3)
我找到了一个有效的解决方案。由于我的一个收藏品的尺寸与另一个相比不成比例,因此我使用侧面输入来加速分组操作。以下是我的解决方案概述:
当使用会话窗口和GroupBy时,它处理在1-2个DataFlow工作者上加入4000个事件/秒流(较大的一个)+ 60个事件/秒流,而不是约15个工作者。