(Databricks广告获利示例)如何找到信息流中的最新匹配项?

时间:2018-10-25 12:45:20

标签: apache-spark spark-structured-streaming

在博客文章"Introducing Stream-Stream Joins in Apache Spark 2.3"中,讨论了将点击和基于其adId的展示结合在一起:

# Define watermarks
impressionsWithWatermark = impressions \
  .selectExpr("adId AS impressionAdId", "impressionTime") \
  .withWatermark("impressionTime", "10 seconds ")   # max 10 seconds late

clicksWithWatermark = clicks \
  .selectExpr("adId AS clickAdId", "clickTime") \
  .withWatermark("clickTime", "20 seconds")        # max 20 seconds late

# Inner join with time range conditions
impressionsWithWatermark.join(
  clicksWithWatermark,
  expr(""" 
   clickAdId = impressionAdId AND 
    clickTime >= impressionTime AND 
    clickTime <= impressionTime + interval 1 minutes    
    """
  )
)

我想知道是否可以过滤结果流,以便每个“查询间隔”中仅包含具有最新clickTime的行。

查询间隔是查询联接条件中给出的间隔:

clickTime >= impressionTime AND 
clickTime <= impressionTime + interval 1 minutes

所以我可能会得到以下顺序

{type:impression, impressionAdId:1, timestamp: 1}
{type:click, clickAdId:1, timestamp: 1}
{type:click, clickAdId:1, timestamp: 15}

在t = 60s左右之后,spark在数据帧中发出以下行:

{impressionTimestamp: 1, clickTimestamp: 15: clickAddId: 1, impressionAdId: 1}

我只发布了python代码,因为那是文章中的内容,也欢迎使用Java或scala代码进行回答。

0 个答案:

没有答案