我有一个类似于以下内容的DataFrame(sqlDF)(此示例已简化),其中我试图删除在另一行的开始日期和结束日期范围内具有start_date和end_date的所有行:
+-------+-------------+-------------------+-------------------+
| id | type| start_date| end_date|
+-------+-------------+-------------------+-------------------+
| 1 | unknown|2018-11-14 16:03:47|2018-12-06 21:23:22| (remove as it's within the next rows start and end dates)
| 1 | ios|2018-10-13 14:58:22|2019-08-26 15:50:45|
| 1 | android|2019-08-29 02:41:40|2019-09-05 23:03:20|
| 2 | ios|2017-12-19 02:25:34|2019-08-09 15:41:30|
| 2 | windows|2018-07-10 05:30:52|2018-07-13 10:11:34| (remove as it's within the previous row's start and end dates)
| 2 | android|2019-05-14 18:33:15|2019-08-27 06:10:53| (remove as it's within another row's start and end dates)
首先,最终用户要求我删除起始日期和结束日期之间少于5天的所有记录,我对以下记录进行了此操作:
val dfWithoutTempHandsets = sqlDF.filter(datediff(col("end_date"), col("start_date")) > 5)
像这样产生一个DataFrame:
+-------+-------------+-------------------+-------------------+
| id | type| start_date| end_date|
+-------+-------------+-------------------+-------------------+
| 1 | unknown|2018-11-14 16:03:47|2018-12-06 21:23:22|
| 1 | ios|2018-10-13 14:58:22|2019-08-26 15:50:45|
| 1 | android|2019-08-29 02:41:40|2019-09-05 23:03:20|
| 2 | ios|2017-12-19 02:25:34|2019-08-09 15:41:30|
| 2 | android|2019-05-14 18:33:15|2019-06-27 06:10:53|
现在,我需要过滤出开始日期和结束日期在同一ID的另一行的开始日期和结束日期“内”的行,以便生成的DataFrame看起来像:
+-------+-------------+-------------------+-------------------+
| id | type| start_date| end_date|
+-------+-------------+-------------------+-------------------+
| 1 | ios|2018-10-13 14:58:22|2019-08-26 15:50:45|
| 1 | android|2019-08-29 02:41:40|2019-09-05 23:03:20|
| 2 | ios|2017-12-19 02:25:34|2019-08-09 15:41:30|
在阅读了几篇关于Spark Window函数的博客文章和堆栈溢出文章之后,我知道这就是答案。但是我正在努力寻找类似用例的示例,其中以这种方式将多个日期与另一行的日期进行比较。我相信我的windowSpec已关闭:
val windowSpec = Window.partitionBy("id", "type").orderBy("start_date")
但是从那里我不确定如何利用windowSpec在该ID的另一行中仅选择没有开始日期和结束日期的行。
编辑:给我一个新的要求,即仅对具有“ NULL”或“ Unknown”类型的行应用以上逻辑。但是这里的答案让我更加接近了!
答案 0 :(得分:2)
这是我考虑使用的逻辑:
如果当前行中的
id
大于或等于ANY中的start_date
,则在窗口分区下按end_date
并以end_date
升序排列。上一行,则当前行中的日期范围必须包含在上一行中的某些日期范围内。
将其转换为示例代码(还包括> 5 days
过滤):
import java.sql.Timestamp
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
(1, "unknown", Timestamp.valueOf("2018-11-14 16:03:47"), Timestamp.valueOf("2018-12-06 21:23:22")),
(1, "ios", Timestamp.valueOf("2018-10-13 14:58:22"), Timestamp.valueOf("2019-08-26 15:50:45")),
(1, "android", Timestamp.valueOf("2019-08-29 02:41:40"), Timestamp.valueOf("2019-09-05 23:03:20")),
(2, "ios", Timestamp.valueOf("2017-12-19 02:25:34"), Timestamp.valueOf("2019-08-09 15:41:30")),
(2, "unknown", Timestamp.valueOf("2018-07-10 05:30:52"), Timestamp.valueOf("2018-07-13 10:11:34")),
(2, "android", Timestamp.valueOf("2019-05-14 18:33:15"), Timestamp.valueOf("2019-06-27 06:10:53"))
).toDF("id", "type", "start_date", "end_date")
val win = Window.partitionBy("id").orderBy($"start_date").
rowsBetween(Window.unboundedPreceding, -1)
df.
where(unix_timestamp($"end_date") - unix_timestamp($"start_date") > 5*24*3600).
withColumn("isContained",
when($"end_date" <= max($"end_date").over(win), true).otherwise(false)
).
where(! $"isContained").
show
// +---+-------+-------------------+-------------------+-----------+
// | id| type| start_date| end_date|isContained|
// +---+-------+-------------------+-------------------+-----------+
// | 1| ios|2018-10-13 14:58:22|2019-08-26 15:50:45| false|
// | 1|android|2019-08-29 02:41:40|2019-09-05 23:03:20| false|
// | 2| ios|2017-12-19 02:25:34|2019-08-09 15:41:30| false|
// +---+-------+-------------------+-------------------+-----------+
请注意,对于> 5 days
过滤,我使用的是unix_timestamp
而不是datediff
,它只是机械地比较day
值的差异(例如datediff({ {1}},2019-01-06 12:00:00
)> 2019-01-01 00:00:00
为假)。
答案 1 :(得分:1)
import org.apache.spark.sql.expressions._
val sqlDF = Seq((1,"unknown","2018-11-14 16:03:47","2018-12-06 21:23:22"),(1,"ios","2018-10-13 14:58:22","2019-08-26 15:50:45"),(1,"android","2019-08-29 02:41:40","2019-09-05 23:03:20"),(2,"ios","2017-12-19 02:25:34","2019-08-09 15:41:30"),(2,"unknown","2018-07-10 05:30:52","2018-07-13 10:11:34"),(2,"android","2019-05-14 18:33:15","2019-06-27 06:10:53")).toDF("id","type","start_date","end_date")
val dfWithoutTempHandsets = sqlDF.filter(datediff(col("end_date"), col("start_date")) > 5)
val windowSpec = Window.partitionBy(dfWithoutTempHandsets("id")).orderBy(dfWithoutTempHandsets("start_date"))
val windowSpec1 = Window.partitionBy(dfWithoutTempHandsets("id")).orderBy((dfWithoutTempHandsets("end_date").desc))
val dense = first(dfWithoutTempHandsets("start_date")).over(windowSpec)
val dense1 = first(dfWithoutTempHandsets("end_date")).over(windowSpec1)
val temp = dfWithoutTempHandsets.select(dfWithoutTempHandsets("id"),dfWithoutTempHandsets("type"),dfWithoutTempHandsets("start_date"),dfWithoutTempHandsets("end_date"),dense.alias("min_start_date"),dense1.alias("max_end_date"))
val finalDf = temp.filter(temp("start_date").leq(temp("min_start_date")).or(temp("end_date").geq(temp("max_end_date"))))
finalDf.show(false)