我有两个数据框:
df1:
+--------------+---------------------+
|id_device |tracking_time |
+--------------+---------------------+
|20 |2020-02-19 02:37:45 |
|5 |2020-02-17 17:15:45 |
+--------------+---------------------+
df2
+--------------+----------------------+
|id_device |tracking_time |
+--------------+----------------------+
|20 | 2019-02-19 02:41:45 |
|20 |2020-01-17 17:15:45 |
+--------------+----------------------+
我想得到以下输出:
+--------------+---------------------+------------------+
|id_device |tracking_time | df2.tracking_time |
+--------------+---------------------+------------------+
|20 |2020-02-19 02:37:45 |2019-02-19 02:41:45|
|5 |2020-02-17 17:15:45 |null |
+--------------+---------------------+-------------------+
我尝试了以下代码:
df1.registerTempTable("data");
df2.createOrReplaceTempView("tdays");
Dataset<Row> d_f = sparkSession.sql("select a.* , b.* from data as a LEFT JOIN (select * from tdays ) as b on b.id_device == a.id_device and b.tracking_time < a.tracking_time ");
我得到以下输出:
+----------------------+---------------------+--------------------+------------------ -+
|id_device |tracking_time | b.id_device |b.tracking_time |
+----------------------+---------------------+--------------------+--------------------+
|20 |2020-02-19 02:37:45 |20 | 2019-02-19 02:41:45|
|20 |2020-02-19 02:37:45 |20 | 2020-01-17 17:15:45|
|5 |2020-02-17 17:15:45 |null |null |
+-----------------------+--------------------+--------------------+--------------------+
我想要的是通过左连接ordered by df2.tracking_time desc limit 1
我需要你的帮助
答案 0 :(得分:1)
在加入之前,您可以将df2
减少到每个id_device
的最小日期:
val df1 = ...
val df2 = ...
val df2min = df2.groupBy("id_device").agg(min("tracking_time")).as("df2.tracking_time")
val result = df1.join(df2min, Seq("id_device"), "left")
df2min
仅包含一行,每个ID的最低日期为df2
。因此,左联接将返回预期结果。