假设我们有一个杂乱的 df,看起来像
val df = Seq(
("id1", "2020-08-02 16:42:00", "2020-08-02 16:45:00", "item1", 1),
("id1", "2020-08-02 16:43:00", "2020-08-02 16:44:00", "item2", 0),
("id1", "2020-08-02 16:44:00", "2020-08-02 16:45:00", "item1", 0),
("id1", "2020-08-02 16:45:00", "2020-08-02 16:47:00", "item3", 0),
("id1", "2020-08-02 16:47:00", "2020-08-02 16:51:00", "item4", 0),
("id1", "2020-08-02 16:51:00", "2020-08-02 16:52:00", "item3", 0))
.toDF("id", "start_time", "end_time", "item_id", "flag")
df.show()
+---+-------------------+-------------------+-------+----+
| id| start_time| end_time|item_id|flag|
+---+-------------------+-------------------+-------+----+
|id1|2020-08-02 16:42:00|2020-08-02 16:45:00| item1| 1|
|id1|2020-08-02 16:43:00|2020-08-02 16:44:00| item2| 0|
|id1|2020-08-02 16:44:00|2020-08-02 16:45:00| item1| 0|
|id1|2020-08-02 16:45:00|2020-08-02 16:47:00| item3| 0|
|id1|2020-08-02 16:47:00|2020-08-02 16:51:00| item4| 0|
|id1|2020-08-02 16:51:00|2020-08-02 16:52:00| item3| 0|
+---+-------------------+-------------------+-------+----+
注意第一行有 start_time = 16:42:00
和 end_time = 16:45:00
,接下来的两行有一个 start_time
,在 start_time
和 end_time
之间第一排。我已经有一个列 flag
可以检测何时观察到这种情况。
在这种情况下,我想保留第一行并删除接下来的两行。我只是使用一个样本,但可以多次看到这种情况。
所以我想要的输出是
+---+-------------------+-------------------+-------+
| id| start_time| end_time|item_id|
+---+-------------------+-------------------+-------+
|id1|2020-08-02 16:42:00|2020-08-02 16:45:00| item1|
|id1|2020-08-02 16:45:00|2020-08-02 16:47:00| item3|
|id1|2020-08-02 16:47:00|2020-08-02 16:51:00| item4|
|id1|2020-08-02 16:51:00|2020-08-02 16:52:00| item3|
+---+-------------------+-------------------+-------+
我尝试创建一个不同的 df 过滤行 flag = 1
并进行条件连接
spark.conf.set("spark.sql.crossJoin.enabled", "true")
val dfFiltered = df.filter("flag == 1")
df.join(dfFiltered,
(df("id") == dfFiltered("id")) &&
(df("start_time") > dfFiltered("start_time")) &&
(df("start_time") < dfFiltered("end_time")))
.show()
但它返回错误的结果
答案 0 :(得分:2)
您想使用 Namespace1: Web1, API1, DB1 and PV1
Namespace2: Web2, API2, DB2 and PV2
...
NampesaceN: WebN, APIN, DBN and PVN
加入:
left_anti
或者使用 val result = df.as("df").drop("flag")
.join(
dfFiltered.as("filter"),
($"df.id" === $"filter.id") &&
($"df.start_time" > $"filter.start_time") &&
($"df.start_time" < $"filter.end_time"),
"left_anti"
)
result.show
//+---+-------------------+-------------------+-------+
//| id| start_time| end_time|item_id|
//+---+-------------------+-------------------+-------+
//|id1|2020-08-02 16:42:00|2020-08-02 16:45:00| item1|
//|id1|2020-08-02 16:45:00|2020-08-02 16:47:00| item3|
//|id1|2020-08-02 16:47:00|2020-08-02 16:51:00| item4|
//|id1|2020-08-02 16:51:00|2020-08-02 16:52:00| item3|
//+---+-------------------+-------------------+-------+
在 WHERE
中使用相关子查询:
EXISTS
答案 1 :(得分:1)
另一种无需加入即可解决此问题的方法 - 您可以获得前几行的最大 end_time,并过滤掉 start < max(end_time) 的行。
import org.apache.spark.sql.expressions.Window
val df2 = df.withColumn(
"last_end",
max(
when($"flag" === 1, $"end_time")
).over(Window.partitionBy("id").orderBy("start_time").rowsBetween(Window.unboundedPreceding, -1))
).filter("last_end is null or start_time >= last_end").drop("last_end")
df2.show
+---+-------------------+-------------------+-------+----+
| id| start_time| end_time|item_id|flag|
+---+-------------------+-------------------+-------+----+
|id1|2020-08-02 16:42:00|2020-08-02 16:45:00| item1| 1|
|id1|2020-08-02 16:45:00|2020-08-02 16:47:00| item3| 0|
|id1|2020-08-02 16:47:00|2020-08-02 16:51:00| item4| 0|
|id1|2020-08-02 16:51:00|2020-08-02 16:52:00| item3| 0|
+---+-------------------+-------------------+-------+----+