Question

我正在尝试使用Scala Spark或SQL在10天之内进行排名和加入。

我有一张表，其中包含用户的尝试次数，另一张表包含合同。他们有我可以加入的ID，但在该ID之上，我需要考虑一个特定的时间范围。为了简化示例，假设我的所有条目都具有相同的ID：

Tries:
try, try_day
Try 1, 2018-08-01 
 Try 2, 2018-09-01
Try 3, 2018-10-01
Try 4, 2018-10-02

Contracts:
contract, contract_day
Contract 1, 2018-08-01
Contract 2, 2018-09-02
Contract 3, 2018-10-01

如果1）尝试之间的差异超过10天，并且2）两个表的日期之间最多相差2天，我只想加入这些。所以最后我得到了：

try, try_day, contract, contract_day, (explanation)

Try 1, 2018-08-01, Contract 1, 2018-08-01 , (same date and more than 10 days between try 1 and try 2)
 Try 2, 2018-09-01, Contract 2, 2018-09-02, (difference of less than 2 days, and more than 10 days between try 2 and try 3)
Try 3, 2018-10-01, null, null (there is less than 10 days difference between try 3 and try 4 so contract should match with try 4 only)
Try 4, 2018-10-02, Contract 3, 2018-10-01

我想我可能想对尝试的日期进行排名，然后仅对排名靠前的一次加入。但是，我只需要在10天的时间范围内进行排名。

SELECT *, dense_rank() OVER (PARTITION BY id ORDER BY try_day DESC) as rank
FROM tries

不幸的是，这将使他们从1到4排名，但我想获得

的排名

try, try_day, rank
Try 1, 2018-08-01, 1 
 Try 2, 2018-09-01, 1
Try 3, 2018-10-01, 2
Try 4, 2018-10-02, 1

然后加入我的排名，排名为1，数据在2天内。

如果有人对如何实现有更好的逻辑想法也表示欢迎。谢谢

Answer 1

这是一种使用unix_timestamp和窗口函数lead来基于以下条件计算rank的方法：连续行之间的try_day s和left-join-在以下条件下对两个DataFrame进行操作：try_day和contract_day：

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
import spark.implicits._

val dfTries = Seq(
  ("Try 1", "2018-08-01"),
  ("Try 2", "2018-09-01"),
  ("Try 3", "2018-10-01"),
  ("Try 4", "2018-10-02")
).toDF("try", "try_day")

val dfContracts = Seq(
  ("contract 1", "2018-08-01"),
  ("contract 2", "2018-09-02"),
  ("contract 3", "2018-10-01"),
).toDF("contract", "contract_day")

dfTries.
  withColumn("try_ts", unix_timestamp($"try_day", "yyyy-MM-dd")).
  withColumn("prev_try_ts", lead($"try_ts", 1).over(Window.orderBy($"try"))).
  withColumn("rank", when(
      $"prev_try_ts".isNull || abs($"try_ts" - $"prev_try_ts") > 10 * 24 * 3600,
      1
    ).otherwise(2)
  ).
  join(
    dfContracts,
    $"rank" === 1 && abs($"try_ts" - unix_timestamp($"contract_day", "yyyy-MM-dd")) <= 2 * 24 * 3600,
    "left_outer").
  show
// +-----+----------+----------+-----------+----+----------+------------+
// |  try|   try_day|    try_ts|prev_try_ts|rank|  contract|contract_day|
// +-----+----------+----------+-----------+----+----------+------------+
// |Try 1|2018-08-01|1533106800| 1535785200|   1|contract 1|  2018-08-01|
// |Try 2|2018-09-01|1535785200| 1538377200|   1|contract 2|  2018-09-02|
// |Try 3|2018-10-01|1538377200| 1538463600|   2|      null|        null|
// |Try 4|2018-10-02|1538463600|       null|   1|contract 3|  2018-10-01|
// +-----+----------+----------+-----------+----+----------+------------+

值得注意的是，在没有partitionBy的情况下使用Window函数无法很好地扩展。

加入日期并在一个时间段内排名

1 个答案: