Spark 条件连接,其中列值介于两个列值之间

时间:2021-02-28 08:12:20

标签: scala apache-spark apache-spark-sql

假设我们有一个杂乱的 df,看起来像

val df = Seq(
    ("id1", "2020-08-02 16:42:00", "2020-08-02 16:45:00", "item1", 1),
    ("id1", "2020-08-02 16:43:00", "2020-08-02 16:44:00", "item2", 0),
    ("id1", "2020-08-02 16:44:00", "2020-08-02 16:45:00", "item1", 0),
    ("id1", "2020-08-02 16:45:00", "2020-08-02 16:47:00", "item3", 0),
    ("id1", "2020-08-02 16:47:00", "2020-08-02 16:51:00", "item4", 0),
    ("id1", "2020-08-02 16:51:00", "2020-08-02 16:52:00", "item3", 0))
.toDF("id", "start_time", "end_time", "item_id", "flag")

df.show()

+---+-------------------+-------------------+-------+----+
| id|         start_time|           end_time|item_id|flag|
+---+-------------------+-------------------+-------+----+
|id1|2020-08-02 16:42:00|2020-08-02 16:45:00|  item1|   1|
|id1|2020-08-02 16:43:00|2020-08-02 16:44:00|  item2|   0|
|id1|2020-08-02 16:44:00|2020-08-02 16:45:00|  item1|   0|
|id1|2020-08-02 16:45:00|2020-08-02 16:47:00|  item3|   0|
|id1|2020-08-02 16:47:00|2020-08-02 16:51:00|  item4|   0|
|id1|2020-08-02 16:51:00|2020-08-02 16:52:00|  item3|   0|
+---+-------------------+-------------------+-------+----+

注意第一行有 start_time = 16:42:00end_time = 16:45:00,接下来的两行有一个 start_time,在 start_timeend_time 之间第一排。我已经有一个列 flag 可以检测何时观察到这种情况。 在这种情况下,我想保留第一行并删除接下来的两行。我只是使用一个样本,但可以多次看到这种情况。

所以我想要的输出是

+---+-------------------+-------------------+-------+
| id|         start_time|           end_time|item_id|
+---+-------------------+-------------------+-------+
|id1|2020-08-02 16:42:00|2020-08-02 16:45:00|  item1|
|id1|2020-08-02 16:45:00|2020-08-02 16:47:00|  item3|
|id1|2020-08-02 16:47:00|2020-08-02 16:51:00|  item4|
|id1|2020-08-02 16:51:00|2020-08-02 16:52:00|  item3|
+---+-------------------+-------------------+-------+

我尝试创建一个不同的 df 过滤行 flag = 1 并进行条件连接

spark.conf.set("spark.sql.crossJoin.enabled", "true")

val dfFiltered = df.filter("flag == 1")

df.join(dfFiltered, 
  (df("id") == dfFiltered("id")) && 
  (df("start_time") > dfFiltered("start_time")) && 
  (df("start_time") < dfFiltered("end_time")))
.show()

但它返回错误的结果

2 个答案:

答案 0 :(得分:2)

您想使用 Namespace1: Web1, API1, DB1 and PV1 Namespace2: Web2, API2, DB2 and PV2 ... NampesaceN: WebN, APIN, DBN and PVN 加入:

left_anti

或者使用 val result = df.as("df").drop("flag") .join( dfFiltered.as("filter"), ($"df.id" === $"filter.id") && ($"df.start_time" > $"filter.start_time") && ($"df.start_time" < $"filter.end_time"), "left_anti" ) result.show //+---+-------------------+-------------------+-------+ //| id| start_time| end_time|item_id| //+---+-------------------+-------------------+-------+ //|id1|2020-08-02 16:42:00|2020-08-02 16:45:00| item1| //|id1|2020-08-02 16:45:00|2020-08-02 16:47:00| item3| //|id1|2020-08-02 16:47:00|2020-08-02 16:51:00| item4| //|id1|2020-08-02 16:51:00|2020-08-02 16:52:00| item3| //+---+-------------------+-------------------+-------+ WHERE 中使用相关子查询:

EXISTS

答案 1 :(得分:1)

另一种无需加入即可解决此问题的方法 - 您可以获得前几行的最大 end_time,并过滤掉 start < max(end_time) 的行。

import org.apache.spark.sql.expressions.Window

val df2 = df.withColumn(
    "last_end", 
    max(
        when($"flag" === 1, $"end_time")
    ).over(Window.partitionBy("id").orderBy("start_time").rowsBetween(Window.unboundedPreceding, -1))
).filter("last_end is null or start_time >= last_end").drop("last_end")

df2.show
+---+-------------------+-------------------+-------+----+
| id|         start_time|           end_time|item_id|flag|
+---+-------------------+-------------------+-------+----+
|id1|2020-08-02 16:42:00|2020-08-02 16:45:00|  item1|   1|
|id1|2020-08-02 16:45:00|2020-08-02 16:47:00|  item3|   0|
|id1|2020-08-02 16:47:00|2020-08-02 16:51:00|  item4|   0|
|id1|2020-08-02 16:51:00|2020-08-02 16:52:00|  item3|   0|
+---+-------------------+-------------------+-------+----+