连接两个数据框,并限制一个数据框的行

时间:2020-08-23 10:29:57

标签: java apache-spark join

我有两个数据框:

df1:
+--------------+---------------------+
|id_device     |tracking_time        |
+--------------+---------------------+
|20            |2020-02-19 02:37:45  |
|5             |2020-02-17 17:15:45  |
+--------------+---------------------+



df2
+--------------+----------------------+
|id_device     |tracking_time         |
+--------------+----------------------+
|20            | 2019-02-19 02:41:45  |
|20            |2020-01-17 17:15:45   |
+--------------+----------------------+

我想得到以下输出:

+--------------+---------------------+------------------+
|id_device     |tracking_time        | df2.tracking_time |
+--------------+---------------------+------------------+
|20            |2020-02-19 02:37:45  |2019-02-19 02:41:45|
|5             |2020-02-17 17:15:45  |null               |
+--------------+---------------------+-------------------+

我尝试了以下代码:

df1.registerTempTable("data");
    df2.createOrReplaceTempView("tdays");     
Dataset<Row> d_f = sparkSession.sql("select a.* , b.*  from data as a  LEFT JOIN (select  * from tdays ) as b  on b.id_device == a.id_device and b.tracking_time < a.tracking_time ");

我得到以下输出:

+----------------------+---------------------+--------------------+------------------ -+
|id_device             |tracking_time        | b.id_device        |b.tracking_time     |
+----------------------+---------------------+--------------------+--------------------+
|20                     |2020-02-19 02:37:45 |20                  | 2019-02-19 02:41:45|
|20                     |2020-02-19 02:37:45 |20                  | 2020-01-17 17:15:45|
|5                      |2020-02-17 17:15:45 |null                |null                |
+-----------------------+--------------------+--------------------+--------------------+

我想要的是通过左连接ordered by df2.tracking_time desc limit 1

的结果连接第一个数据框

我需要你的帮助

1 个答案:

答案 0 :(得分:1)

在加入之前,您可以将df2减少到每个id_device的最小日期:

val df1 = ...
val df2 = ...
val df2min = df2.groupBy("id_device").agg(min("tracking_time")).as("df2.tracking_time")
val result = df1.join(df2min, Seq("id_device"), "left")

df2min仅包含一行,每个ID的最低日期为df2。因此,左联接将返回预期结果。