团队,我正在使用如下数据框:
df
client | date
C1 |08-NOV-18 11.29.43
C2 |09-NOV-18 13.29.43
C2 |09-NOV-18 18.29.43
C3 |11-NOV-18 19.29.43
C1 |12-NOV-18 10.29.43
C2 |13-NOV-18 09.29.43
C4 |14-NOV-18 20.29.43
C1 |15-NOV-18 11.29.43
C5 |16-NOV-18 15.29.43
C10 |17-NOV-18 19.29.43
C1 |18-NOV-18 12.29.43
C2 |18-NOV-18 10.29.43
C2 |19-NOV-18 09.29.43
C6 |20-NOV-18 13.29.43
C6 |21-NOV-18 14.29.43
C1 |21-NOV-18 18.29.43
C1 |22-NOV-18 11.29.43
我的目标是过滤此数据帧,并获取包含每个客户端最近两次出现的新数据帧(例如,此事件为<24小时),例如,此示例的结果必须为:
client |date
C2 |18-NOV-18 10.29.43
C2 |19-NOV-18 09.29.43
C1 |21-NOV-18 18.29.43
C1 |22-NOV-18 11.29.43
任何帮助,请!
答案 0 :(得分:1)
使用窗口功能。检查一下:
val df = Seq(("C1","08-NOV-18 11.29.43"),
("C2","09-NOV-18 13.29.43"),
("C2","09-NOV-18 18.29.43"),
("C3","11-NOV-18 19.29.43"),
("C1","12-NOV-18 10.29.43"),
("C2","13-NOV-18 09.29.43"),
("C4","14-NOV-18 20.29.43"),
("C1","15-NOV-18 11.29.43"),
("C5","16-NOV-18 15.29.43"),
("C10","17-NOV-18 19.29.43"),
("C1","18-NOV-18 12.29.43"),
("C2","18-NOV-18 10.29.43"),
("C2","19-NOV-18 09.29.43"),
("C6","20-NOV-18 13.29.43"),
("C6","21-NOV-18 14.29.43"),
("C1","21-NOV-18 18.29.43"),
("C1","22-NOV-18 11.29.43")).toDF("client","dt").withColumn("dt",from_unixtime(unix_timestamp('dt,"dd-MMM-yy HH.mm.ss"),"yyyy-MM-dd HH:mm:ss"))
df.createOrReplaceTempView("tbl")
val df2 = spark.sql(""" select * from ( select client, dt, count(*) over(partition by client ) cnt, rank() over(partition by client order by dt desc) rk1 from tbl ) t where cnt>1 and rk1 in (1,2) """)
df2.alias("t1").join(df2.alias("t2"), $"t1.client" === $"t2.client" and $"t1.rk1" =!= $"t2.rk1" , "inner" ).withColumn("dt24",(unix_timestamp($"t1.dt") - unix_timestamp($"t2.dt") )/ 3600 ).where("dt24 > -24 and dt24 < 24").select($"t1.client", $"t1.dt").show(false)
结果:
+------+-------------------+
|client|dt |
+------+-------------------+
|C1 |2018-11-22 11:29:43|
|C1 |2018-11-21 18:29:43|
|C2 |2018-11-19 09:29:43|
|C2 |2018-11-18 10:29:43|
+------+-------------------+
答案 1 :(得分:0)
对于这种情况,我有一个解决方案:
val milliSecForADay = 24 * 60 * 60 * 1000
val filterDatesUDF = udf { arr: scala.collection.mutable.WrappedArray[Timestamp] =>
arr.sortWith(_ after _).toList match {
case last :: secondLast :: _ if (last.getTime - secondLast.getTime) < milliSecForADay => Array(secondLast, last)
case _ => Array.empty[Timestamp]
}
}
val finalDF = df.groupBy("client")
.agg(collect_list("date").as("dates"))
.select(col("client"), explode(filterDatesUDF(col("dates"))).as("date"))
.show()
在此解决方案中,首先,我使用user-defined function
或udf
根据客户端将数据分组,以处理为每个客户端分组的时间戳。
这是在假设date
列已经为Timestamp
格式(我认为可能不正确)的情况下完成的。如果您将date
列作为String
类型,请在上述解决方案之前添加以下代码,以将date
列的类型从String
转换为{{1} }。
Timestamp
答案 2 :(得分:0)
通过窗口功能,可以找到下一个/上一个日期,然后可以过滤出行之间的差异大于24小时的行。
数据准备
val df = Seq(("C1", "08-NOV-18 11.29.43"),
("C2", "09-NOV-18 13.29.43"),
("C2", "09-NOV-18 18.29.43"),
("C3", "11-NOV-18 19.29.43"),
("C1", "12-NOV-18 10.29.43"),
("C2", "13-NOV-18 09.29.43"),
("C4", "14-NOV-18 20.29.43"),
("C1", "15-NOV-18 11.29.43"),
("C5", "16-NOV-18 15.29.43"),
("C10", "17-NOV-18 19.29.43"),
("C1", "18-NOV-18 12.29.43"),
("C2", "18-NOV-18 10.29.43"),
("C2", "19-NOV-18 09.29.43"),
("C6", "20-NOV-18 13.29.43"),
("C6", "21-NOV-18 14.29.43"),
("C1", "21-NOV-18 18.29.43"),
("C1", "22-NOV-18 11.29.43"))
.toDF("client", "dt")
.withColumn("dt", to_timestamp($"dt", "dd-MMM-yy HH.mm.ss"))
代理代码
// get next/prev dates
val dateWindow = Window.partitionBy("client").orderBy("dt")
val withNextPrevDates = df
.withColumn("previousDate", lag($"dt", 1).over(dateWindow))
.withColumn("nextDate", lead($"dt", 1).over(dateWindow))
// function for filter
val secondsInDay = TimeUnit.DAYS.toSeconds(1)
val dateDiffLessThanDay = (startTimeStamp: Column, endTimeStamp: Column) =>
endTimeStamp.cast(LongType) - startTimeStamp.cast(LongType) < secondsInDay && datediff(endTimeStamp, startTimeStamp) === 1
// filter
val result = withNextPrevDates
.where(dateDiffLessThanDay($"previousDate", $"dt") || dateDiffLessThanDay($"dt", $"nextDate"))
.drop("previousDate", "nextDate")
结果
+------+-------------------+
|client|dt |
+------+-------------------+
|C1 |2018-11-21 18:29:43|
|C1 |2018-11-22 11:29:43|
|C2 |2018-11-18 10:29:43|
|C2 |2018-11-19 09:29:43|
+------+-------------------+