我正在尝试使用Scala Spark或SQL在10天之内进行排名和加入。
我有一张表,其中包含用户的尝试次数,另一张表包含合同。他们有我可以加入的ID,但在该ID之上,我需要考虑一个特定的时间范围。为了简化示例,假设我的所有条目都具有相同的ID:
Tries:
try, try_day
Try 1, 2018-08-01
Try 2, 2018-09-01
Try 3, 2018-10-01
Try 4, 2018-10-02
Contracts:
contract, contract_day
Contract 1, 2018-08-01
Contract 2, 2018-09-02
Contract 3, 2018-10-01
如果1)尝试之间的差异超过10天,并且2)两个表的日期之间最多相差2天,我只想加入这些。所以最后我得到了:
try, try_day, contract, contract_day, (explanation)
Try 1, 2018-08-01, Contract 1, 2018-08-01 , (same date and more than 10 days between try 1 and try 2)
Try 2, 2018-09-01, Contract 2, 2018-09-02, (difference of less than 2 days, and more than 10 days between try 2 and try 3)
Try 3, 2018-10-01, null, null (there is less than 10 days difference between try 3 and try 4 so contract should match with try 4 only)
Try 4, 2018-10-02, Contract 3, 2018-10-01
我想我可能想对尝试的日期进行排名,然后仅对排名靠前的一次加入。但是,我只需要在10天的时间范围内进行排名。
SELECT *, dense_rank() OVER (PARTITION BY id ORDER BY try_day DESC) as rank
FROM tries
不幸的是,这将使他们从1到4排名,但我想获得
的排名try, try_day, rank
Try 1, 2018-08-01, 1
Try 2, 2018-09-01, 1
Try 3, 2018-10-01, 2
Try 4, 2018-10-02, 1
然后加入我的排名,排名为1,数据在2天内。
如果有人对如何实现有更好的逻辑想法也表示欢迎。谢谢
答案 0 :(得分:1)
这是一种使用unix_timestamp
和窗口函数lead
来基于以下条件计算rank
的方法:连续行之间的try_day
s和left-join
-在以下条件下对两个DataFrame进行操作:try_day
和contract_day
:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
import spark.implicits._
val dfTries = Seq(
("Try 1", "2018-08-01"),
("Try 2", "2018-09-01"),
("Try 3", "2018-10-01"),
("Try 4", "2018-10-02")
).toDF("try", "try_day")
val dfContracts = Seq(
("contract 1", "2018-08-01"),
("contract 2", "2018-09-02"),
("contract 3", "2018-10-01"),
).toDF("contract", "contract_day")
dfTries.
withColumn("try_ts", unix_timestamp($"try_day", "yyyy-MM-dd")).
withColumn("prev_try_ts", lead($"try_ts", 1).over(Window.orderBy($"try"))).
withColumn("rank", when(
$"prev_try_ts".isNull || abs($"try_ts" - $"prev_try_ts") > 10 * 24 * 3600,
1
).otherwise(2)
).
join(
dfContracts,
$"rank" === 1 && abs($"try_ts" - unix_timestamp($"contract_day", "yyyy-MM-dd")) <= 2 * 24 * 3600,
"left_outer").
show
// +-----+----------+----------+-----------+----+----------+------------+
// | try| try_day| try_ts|prev_try_ts|rank| contract|contract_day|
// +-----+----------+----------+-----------+----+----------+------------+
// |Try 1|2018-08-01|1533106800| 1535785200| 1|contract 1| 2018-08-01|
// |Try 2|2018-09-01|1535785200| 1538377200| 1|contract 2| 2018-09-02|
// |Try 3|2018-10-01|1538377200| 1538463600| 2| null| null|
// |Try 4|2018-10-02|1538463600| null| 1|contract 3| 2018-10-01|
// +-----+----------+----------+-----------+----+----------+------------+
值得注意的是,在没有partitionBy
的情况下使用Window函数无法很好地扩展。