我有一个DateFrame如下:
+---+---------------------+---------------------+
|id |initDate |endDate |
+---+---------------------+---------------------+
|138|2016-04-15 00:00:00.0|2016-04-28 00:00:00.0|
|138|2016-05-09 00:00:00.0|2016-05-23 00:00:00.0|
|138|2016-06-04 00:00:00.0|2016-06-18 00:00:00.0|
|138|2016-06-18 00:00:00.0|2016-07-02 00:00:00.0|
|138|2016-07-09 00:00:00.0|2016-07-23 00:00:00.0|
|138|2016-07-27 00:00:00.0|2016-08-10 00:00:00.0|
|138|2016-08-18 00:00:00.0|2016-09-01 00:00:00.0|
|138|2016-09-13 00:00:00.0|2016-09-27 00:00:00.0|
|138|2016-10-04 00:00:00.0|null |
+---+---------------------+---------------------+
行按id
排序,然后initDate
列按升序排序。
initDate
和endDate
列都具有时间戳类型。为了说明的目的,我只显示了属于一个id
值的记录。
我的目标是添加一个新列,在每行的id
和initDate
之间显示每个endDate
的差异(以天为单位)前一行。
如果没有前一行,则该值为-1。
输出应如下所示:
+---+---------------------+---------------------+----------+
|id |initDate |endDate |difference|
+---+---------------------+---------------------+----------+
|138|2016-04-15 00:00:00.0|2016-04-28 00:00:00.0|-1 |
|138|2016-05-09 00:00:00.0|2016-05-23 00:00:00.0|11 |
|138|2016-06-04 00:00:00.0|2016-06-18 00:00:00.0|12 |
|138|2016-06-18 00:00:00.0|2016-07-02 00:00:00.0|0 |
|138|2016-07-09 00:00:00.0|2016-07-23 00:00:00.0|7 |
|138|2016-07-27 00:00:00.0|2016-08-10 00:00:00.0|4 |
|138|2016-08-18 00:00:00.0|2016-09-01 00:00:00.0|8 |
|138|2016-09-13 00:00:00.0|2016-09-27 00:00:00.0|12 |
|138|2016-10-04 00:00:00.0|null |7 |
+---+---------------------+---------------------+----------+
我正在考虑使用窗口函数按id
对记录进行分区,但我不知道如何进行后续步骤。
答案 0 :(得分:6)
尝试:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val w = Window.partitionBy("id").orderBy("endDate")
df.withColumn("difference", date_sub($"initDate", lag($"endDate", 1).over(w)))
答案 1 :(得分:4)
感谢@lostInOverflow的提示,我提出了以下解决方案:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val w = Window.partitionBy("id").orderBy("initDate")
val previousEnd = lag($"endDate", 1).over(w)
filteredDF.withColumn("prev", previousEnd)
.withColumn("difference", datediff($"initDate", $"prev"))
答案 2 :(得分:0)
如果以前有人希望使用spark sql或在Hive上尝试,则仅是以前很好的答案的补充。
select tab.tran_id,tab.init_date,tab.end_date,coalesce(tab.day_diff,-1)
as day_diffrence from
(select *,datediff(day,lag(end_date,1) over(partition by tran_id order by init_date)
,init_date) as day_diff from your_table) tab
;