如何基于条件为日期列的列中值的出现来过滤spark数据帧?

时间:2018-11-22 11:13:13

标签: scala apache-spark apache-spark-sql

团队,我正在使用如下数据框:

    df
    client   | date   
      C1     |08-NOV-18 11.29.43
      C2     |09-NOV-18 13.29.43
      C2     |09-NOV-18 18.29.43
      C3     |11-NOV-18 19.29.43
      C1     |12-NOV-18 10.29.43
      C2     |13-NOV-18 09.29.43
      C4     |14-NOV-18 20.29.43
      C1     |15-NOV-18 11.29.43
      C5     |16-NOV-18 15.29.43
      C10    |17-NOV-18 19.29.43
      C1     |18-NOV-18 12.29.43
      C2     |18-NOV-18 10.29.43
      C2     |19-NOV-18 09.29.43
      C6     |20-NOV-18 13.29.43
      C6     |21-NOV-18 14.29.43
      C1     |21-NOV-18 18.29.43
      C1     |22-NOV-18 11.29.43

我的目标是过滤此数据帧,并获取包含每个客户端最近两次出现的新数据帧(例如,此事件为<24小时),例如,此示例的结果必须为:

     client  |date
      C2     |18-NOV-18 10.29.43
      C2     |19-NOV-18 09.29.43
      C1     |21-NOV-18 18.29.43
      C1     |22-NOV-18 11.29.43

任何帮助,请!

3 个答案:

答案 0 :(得分:1)

使用窗口功能。检查一下:

val df = Seq(("C1","08-NOV-18 11.29.43"),
  ("C2","09-NOV-18 13.29.43"),
  ("C2","09-NOV-18 18.29.43"),
  ("C3","11-NOV-18 19.29.43"),
  ("C1","12-NOV-18 10.29.43"),
  ("C2","13-NOV-18 09.29.43"),
  ("C4","14-NOV-18 20.29.43"),
  ("C1","15-NOV-18 11.29.43"),
  ("C5","16-NOV-18 15.29.43"),
  ("C10","17-NOV-18 19.29.43"),
  ("C1","18-NOV-18 12.29.43"),
  ("C2","18-NOV-18 10.29.43"),
  ("C2","19-NOV-18 09.29.43"),
  ("C6","20-NOV-18 13.29.43"),
  ("C6","21-NOV-18 14.29.43"),
  ("C1","21-NOV-18 18.29.43"),
  ("C1","22-NOV-18 11.29.43")).toDF("client","dt").withColumn("dt",from_unixtime(unix_timestamp('dt,"dd-MMM-yy HH.mm.ss"),"yyyy-MM-dd HH:mm:ss"))

df.createOrReplaceTempView("tbl")

val df2 = spark.sql(""" select * from ( select client, dt, count(*) over(partition by client ) cnt, rank() over(partition by client order by dt desc) rk1  from tbl ) t where cnt>1 and rk1 in (1,2) """)

df2.alias("t1").join(df2.alias("t2"), $"t1.client" === $"t2.client" and $"t1.rk1" =!= $"t2.rk1" , "inner" ).withColumn("dt24",(unix_timestamp($"t1.dt") - unix_timestamp($"t2.dt") )/ 3600 ).where("dt24 > -24 and dt24 < 24").select($"t1.client", $"t1.dt").show(false)

结果:

+------+-------------------+
|client|dt                 |
+------+-------------------+
|C1    |2018-11-22 11:29:43|
|C1    |2018-11-21 18:29:43|
|C2    |2018-11-19 09:29:43|
|C2    |2018-11-18 10:29:43|
+------+-------------------+

答案 1 :(得分:0)

对于这种情况,我有一个解决方案:

  val milliSecForADay = 24 * 60 * 60 * 1000

  val filterDatesUDF = udf { arr: scala.collection.mutable.WrappedArray[Timestamp] =>
    arr.sortWith(_ after _).toList match {
      case last :: secondLast :: _ if (last.getTime - secondLast.getTime) < milliSecForADay => Array(secondLast, last)
      case _ => Array.empty[Timestamp]
    }
  }

  val finalDF = df.groupBy("client")
    .agg(collect_list("date").as("dates"))
    .select(col("client"), explode(filterDatesUDF(col("dates"))).as("date"))
    .show()

在此解决方案中,首先,我使用user-defined functionudf根据客户端将数据分组,以处理为每个客户端分组的时间戳。

这是在假设date列已经为Timestamp格式(我认为可能不正确)的情况下完成的。如果您将date列作为String类型,请在上述解决方案之前添加以下代码,以将date列的类型从String转换为{{1} }。

Timestamp

答案 2 :(得分:0)

通过窗口功能,可以找到下一个/上一个日期,然后可以过滤出行之间的差异大于24小时的行。

数据准备

val df = Seq(("C1", "08-NOV-18 11.29.43"),
  ("C2", "09-NOV-18 13.29.43"),
  ("C2", "09-NOV-18 18.29.43"),
  ("C3", "11-NOV-18 19.29.43"),
  ("C1", "12-NOV-18 10.29.43"),
  ("C2", "13-NOV-18 09.29.43"),
  ("C4", "14-NOV-18 20.29.43"),
  ("C1", "15-NOV-18 11.29.43"),
  ("C5", "16-NOV-18 15.29.43"),
  ("C10", "17-NOV-18 19.29.43"),
  ("C1", "18-NOV-18 12.29.43"),
  ("C2", "18-NOV-18 10.29.43"),
  ("C2", "19-NOV-18 09.29.43"),
  ("C6", "20-NOV-18 13.29.43"),
  ("C6", "21-NOV-18 14.29.43"),
  ("C1", "21-NOV-18 18.29.43"),
  ("C1", "22-NOV-18 11.29.43"))
  .toDF("client", "dt")
  .withColumn("dt", to_timestamp($"dt", "dd-MMM-yy HH.mm.ss"))

代理代码

// get next/prev dates
val dateWindow = Window.partitionBy("client").orderBy("dt")
val withNextPrevDates = df
  .withColumn("previousDate", lag($"dt", 1).over(dateWindow))
  .withColumn("nextDate", lead($"dt", 1).over(dateWindow))

// function for filter
val secondsInDay = TimeUnit.DAYS.toSeconds(1)
val dateDiffLessThanDay = (startTimeStamp: Column, endTimeStamp: Column) =>
  endTimeStamp.cast(LongType) - startTimeStamp.cast(LongType) < secondsInDay && datediff(endTimeStamp, startTimeStamp) === 1

// filter
val result = withNextPrevDates
  .where(dateDiffLessThanDay($"previousDate", $"dt") || dateDiffLessThanDay($"dt", $"nextDate"))
  .drop("previousDate", "nextDate")

结果

+------+-------------------+
|client|dt                 |
+------+-------------------+
|C1    |2018-11-21 18:29:43|
|C1    |2018-11-22 11:29:43|
|C2    |2018-11-18 10:29:43|
|C2    |2018-11-19 09:29:43|
+------+-------------------+