Question

我希望能够从2个PubSub主题中读取消息，并将第二条消息与第一条消息相关联。例如，我们从pubsub主题“得分”中读取用户得分，并且从另一个pub子主题“results”中获胜者。请注意，当我从pubsub读取时，我使用.withTimestampAttribute（“timestamp”），因此它可以使用事件时间进行处理，如果得分消息与结果消息相关联，那么两个消息的事件时间将在发布时设置为相同发布订阅。流式传输管道应该能够输出关于谁赢了以及谁在用户中丢失的结果。这会将输出每分钟写入Google云端存储。

下面是我尝试执行此操作的代码段，但问题是在尝试关联这两条消息时，很少有边缘情况不起作用。比如说，我知道一个事实，即获胜者是谁的结果，只有在该事件的分数发生后才可用。因此我想允许额外的5分钟到达。但是当我尝试CoGroupByKey时，这两条消息就不会等到另一条消息了。

//Read from PubSub
PCollection<String> scoresInput = pipeline.apply(PubsubIO.readStrings().withTimestampAttribute("timestamp")
.fromTopic("projects/test/topics/scores"));
PCollection<String> winInput = pipeline.apply(PubsubIO.readStrings().withTimestampAttribute("timestamp")
.fromTopic("projects/test/topics/results"));

//Apply some transformation on the scores message
PCollection<KV<String, User>> scoreToTrx = scoresInput.apply(Window.<String> 
into(FixedWindows.of(Duration.standardMinutes(1)))
            .triggering(Repeatedly.forever(AfterWatermark.pastEndOfWindow()))
            .discardingFiredPanes()
            .withAllowedLateness(Duration.ZERO))
            .apply(ParDo.of(new ExtractScoresToTransactionIdFn()));

//group by all common trx
PCollection<KV<String, Iterable<User>>> scoresToTrxGrouped =  
scoreToTrx.apply(GroupByKey.<String, User>create());

//Create one object with array of users for transaction. Users(transactionid, 
//array of all user names, array of all scores, array of win info(defaulted to 
//false))
PCollection<KV<String, Users>> users = scoresToTrxGrouped.apply(ParDo.of(new 
ProcessAllUsersAndScoresToTransactionIdFn()));

//Apply some transformation on the results message
PCollection<KV<String, Winner>> winToTrx = winInput.apply(Window.<String> 
into(FixedWindows.of(Duration.standardMinutes(1)))
            .triggering(Repeatedly.forever(AfterWatermark.pastEndOfWindow()))
            .discardingFiredPanes()
            .withAllowedLateness(Duration.standardMinutes(5)))
            .apply(ParDo.of(new ExtractWinToTransactionIdFn()));

//Now associate the user score to winner.
final TupleTag<Users> scoreTag = new TupleTag<Users>();
final TupleTag<Winner> winTag = new TupleTag<Winner>();

PCollection<KV<String, CoGbkResult>> trxIdToUserAndWinnerCoGbkResult = 
KeyedPCollectionTuple.of(scoreTag, users)
            .and(winTag, winToTrx).apply(CoGroupByKey.<String> create());

//The dofn here will check if the winner is present in the users list and then 
//update the wininfo of that user to 
//winner.       
PCollection<String> joinUserToWinner = trxIdToUserAndWinnerCoGbkResult
            .apply(ParDo.of(new MapWinnerToUserForTransactionFn(scoreTag, 
winTag)));

比如说我在2017年10月12日下午4.12.04发布了以下得分消息

{ “transactionId”：“1234”， “userName”：“艾米”， “得分”：“10” }

然后在2017年10月12日下午4点23分再举行一次

{ “transactionId”：“1234”， “userName”：“Becca”， “得分”：“7” }

最后，获奖者将于2017年10月12日下午4点15分20分发布

{ “transactionId”：“1234”， “胜利者”：“艾米” }

由于它是1分钟的固定窗口，所以此案例的窗口将是[4.12.00,4.13.00]，但由于获胜者不属于此窗口，因此不会被视为并且输出为无赢家

{ “transactionId”：“1234”， “用户”：[“Amy”，“Becca”]， “得分”：[“10”，“7”]， “winInfo”：[“失败者”，“失败者”] }

可能发生的另一种情况是结果消息在得分之前到达管道，因为得分消息可能在groupby之后等待被触发，在这种情况下结果消息被丢弃，因为它无法与任何关联即使有胜利者，分数和输出也是相同的，没有获胜者。

请注意，在某些数据太晚的情况下可以，我不想等待它们。但简单的1：1关联案例表明，只有1个用户得分，1个获胜者不被尊重。

如果我从一个输入源读取，我知道我可以调整我的触发器以达到我想要的效果，但在我的情况下，数据来自两个不同的来源，并且相互依赖以产生正确的结果。

我的问题是我们如何才能尊重这样的情况，其中触发一个依赖于另一个输入数据。

Apache Beam从2个输入源读取数据，在某些情况下无法正确连接数据

0 个答案: