Spark Window功能:筛选出开始日期和结束日期在另一行开始日期和结束日期范围内的行

时间:2019-09-11 22:47:48

标签: scala dataframe apache-spark window-functions

我有一个类似于以下内容的DataFrame(sqlDF)(此示例已简化),其中我试图删除在另一行的开始日期和结束日期范围内具有start_date和end_date的所有行:

+-------+-------------+-------------------+-------------------+
|    id |         type|         start_date|           end_date|
+-------+-------------+-------------------+-------------------+
|  1    |      unknown|2018-11-14 16:03:47|2018-12-06 21:23:22| (remove as it's within the next rows start and end dates)
|  1    |          ios|2018-10-13 14:58:22|2019-08-26 15:50:45|
|  1    |      android|2019-08-29 02:41:40|2019-09-05 23:03:20|
|  2    |          ios|2017-12-19 02:25:34|2019-08-09 15:41:30|
|  2    |      windows|2018-07-10 05:30:52|2018-07-13 10:11:34| (remove as it's within the previous row's start and end dates)
|  2    |      android|2019-05-14 18:33:15|2019-08-27 06:10:53| (remove as it's within another row's start and end dates)

首先,最终用户要求我删除起始日期和结束日期之间少于5天的所有记录,我对以下记录进行了此操作:

val dfWithoutTempHandsets = sqlDF.filter(datediff(col("end_date"), col("start_date")) > 5)

像这样产生一个DataFrame:

+-------+-------------+-------------------+-------------------+
|    id |         type|         start_date|           end_date|
+-------+-------------+-------------------+-------------------+
|  1    |      unknown|2018-11-14 16:03:47|2018-12-06 21:23:22| 
|  1    |          ios|2018-10-13 14:58:22|2019-08-26 15:50:45|
|  1    |      android|2019-08-29 02:41:40|2019-09-05 23:03:20|
|  2    |          ios|2017-12-19 02:25:34|2019-08-09 15:41:30|
|  2    |      android|2019-05-14 18:33:15|2019-06-27 06:10:53|

现在,我需要过滤出开始日期和结束日期在同一ID的另一行的开始日期和结束日期“内”的行,以便生成的DataFrame看起来像:

+-------+-------------+-------------------+-------------------+
|    id |         type|         start_date|           end_date|
+-------+-------------+-------------------+-------------------+
|  1    |          ios|2018-10-13 14:58:22|2019-08-26 15:50:45|
|  1    |      android|2019-08-29 02:41:40|2019-09-05 23:03:20|
|  2    |          ios|2017-12-19 02:25:34|2019-08-09 15:41:30|

在阅读了几篇关于Spark Window函数的博客文章和堆栈溢出文章之后,我知道这就是答案。但是我正在努力寻找类似用例的示例,其中以这种方式将多个日期与另一行的日期进行比较。我相信我的windowSpec已关闭:

val windowSpec = Window.partitionBy("id", "type").orderBy("start_date")

但是从那里我不确定如何利用windowSpec在该ID的另一行中仅选择没有开始日期和结束日期的行。

编辑:给我一个新的要求,即仅对具有“ NULL”或“ Unknown”类型的行应用以上逻辑。但是这里的答案让我更加接近了!

2 个答案:

答案 0 :(得分:2)

这是我考虑使用的逻辑:

  

如果当前行中的id大于或等于ANY中的start_date,则在窗口分区下按end_date并以end_date升序排列。上一行,则当前行中的日期范围必须包含在上一行中的某些日期范围内。

将其转换为示例代码(还包括> 5 days过滤):

import java.sql.Timestamp
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import spark.implicits._

val df = Seq(
  (1, "unknown", Timestamp.valueOf("2018-11-14 16:03:47"), Timestamp.valueOf("2018-12-06 21:23:22")),
  (1, "ios", Timestamp.valueOf("2018-10-13 14:58:22"), Timestamp.valueOf("2019-08-26 15:50:45")),
  (1, "android", Timestamp.valueOf("2019-08-29 02:41:40"), Timestamp.valueOf("2019-09-05 23:03:20")),
  (2, "ios", Timestamp.valueOf("2017-12-19 02:25:34"), Timestamp.valueOf("2019-08-09 15:41:30")),
  (2, "unknown", Timestamp.valueOf("2018-07-10 05:30:52"), Timestamp.valueOf("2018-07-13 10:11:34")),
  (2, "android", Timestamp.valueOf("2019-05-14 18:33:15"), Timestamp.valueOf("2019-06-27 06:10:53"))
).toDF("id", "type", "start_date", "end_date")

val win = Window.partitionBy("id").orderBy($"start_date").
  rowsBetween(Window.unboundedPreceding, -1)

df.
  where(unix_timestamp($"end_date") - unix_timestamp($"start_date") > 5*24*3600).
  withColumn("isContained",
    when($"end_date" <= max($"end_date").over(win), true).otherwise(false)
  ).
  where(! $"isContained").
  show
// +---+-------+-------------------+-------------------+-----------+
// | id|   type|         start_date|           end_date|isContained|
// +---+-------+-------------------+-------------------+-----------+
// |  1|    ios|2018-10-13 14:58:22|2019-08-26 15:50:45|      false|
// |  1|android|2019-08-29 02:41:40|2019-09-05 23:03:20|      false|
// |  2|    ios|2017-12-19 02:25:34|2019-08-09 15:41:30|      false|
// +---+-------+-------------------+-------------------+-----------+

请注意,对于> 5 days过滤,我使用的是unix_timestamp而不是datediff,它只是机械地比较day值的差异(例如datediff({ {1}},2019-01-06 12:00:00)> 2019-01-01 00:00:00为假)。

答案 1 :(得分:1)

import org.apache.spark.sql.expressions._

val sqlDF = Seq((1,"unknown","2018-11-14 16:03:47","2018-12-06 21:23:22"),(1,"ios","2018-10-13 14:58:22","2019-08-26 15:50:45"),(1,"android","2019-08-29 02:41:40","2019-09-05 23:03:20"),(2,"ios","2017-12-19 02:25:34","2019-08-09 15:41:30"),(2,"unknown","2018-07-10 05:30:52","2018-07-13 10:11:34"),(2,"android","2019-05-14 18:33:15","2019-06-27 06:10:53")).toDF("id","type","start_date","end_date")

val dfWithoutTempHandsets = sqlDF.filter(datediff(col("end_date"), col("start_date")) > 5)

val windowSpec = Window.partitionBy(dfWithoutTempHandsets("id")).orderBy(dfWithoutTempHandsets("start_date"))

val windowSpec1 = Window.partitionBy(dfWithoutTempHandsets("id")).orderBy((dfWithoutTempHandsets("end_date").desc))

val dense = first(dfWithoutTempHandsets("start_date")).over(windowSpec)

val dense1 = first(dfWithoutTempHandsets("end_date")).over(windowSpec1)

val temp = dfWithoutTempHandsets.select(dfWithoutTempHandsets("id"),dfWithoutTempHandsets("type"),dfWithoutTempHandsets("start_date"),dfWithoutTempHandsets("end_date"),dense.alias("min_start_date"),dense1.alias("max_end_date"))

val finalDf = temp.filter(temp("start_date").leq(temp("min_start_date")).or(temp("end_date").geq(temp("max_end_date"))))

finalDf.show(false)