在博客文章"Introducing Stream-Stream Joins in Apache Spark 2.3"中,讨论了将点击和基于其adId的展示结合在一起:
# Define watermarks
impressionsWithWatermark = impressions \
.selectExpr("adId AS impressionAdId", "impressionTime") \
.withWatermark("impressionTime", "10 seconds ") # max 10 seconds late
clicksWithWatermark = clicks \
.selectExpr("adId AS clickAdId", "clickTime") \
.withWatermark("clickTime", "20 seconds") # max 20 seconds late
# Inner join with time range conditions
impressionsWithWatermark.join(
clicksWithWatermark,
expr("""
clickAdId = impressionAdId AND
clickTime >= impressionTime AND
clickTime <= impressionTime + interval 1 minutes
"""
)
)
我想知道是否可以过滤结果流,以便每个“查询间隔”中仅包含具有最新clickTime的行。
查询间隔是查询联接条件中给出的间隔:
clickTime >= impressionTime AND
clickTime <= impressionTime + interval 1 minutes
所以我可能会得到以下顺序
{type:impression, impressionAdId:1, timestamp: 1}
{type:click, clickAdId:1, timestamp: 1}
{type:click, clickAdId:1, timestamp: 15}
在t = 60s左右之后,spark在数据帧中发出以下行:
{impressionTimestamp: 1, clickTimestamp: 15: clickAddId: 1, impressionAdId: 1}
我只发布了python代码,因为那是文章中的内容,也欢迎使用Java或scala代码进行回答。