排除窗口分析Spark数据帧中的当前行值

时间:2018-03-16 13:36:32

标签: apache-spark dataframe windowing

使用窗口分析功能,我需要获取最大日期 - 不包括当前行的列值

Account,Instrument,TrDate
1,A,3/1/2018
1,A,3/2/2018
1,B,3/3/2018
1,B,3/6/2018
1,B,3/6/2018
1,B,3/7/2018
2,A,2/7/2018
2,A,2/5/2018
2,B,2/15/2018
2,B,3/6/2018

预期转换DF

Account,Instrument,TrDate,MaxInDate,ExcInstrMaxDate
1,A,3/1/2018,3/2/2018,3/7/2018
1,A,3/2/2018,3/2/2018,3/7/2018
1,B,3/3/2018,3/7/2018,3/2/2018
1,B,3/6/2018,3/7/2018,3/2/2018
1,B,3/6/2018,3/7/2018,3/2/2018
1,B,3/7/2018,3/7/2018,3/2/2018
2,A,2/7/2018,2/7/2018,3/6/2018
2,A,2/5/2018,2/7/2018,3/6/2018
2,B,2/15/2018,3/6/2018,2/7/2018
2,B,3/6/2018,3/6/2018,2/7/2018

计算ExcInstrMaxDate

在帐户窗口中获取Max TrDate,不包括该特定说明 即对于Accout 1,工具A,ExcInstrMaxDate是由工具A过滤的帐户1的maxDate

1 个答案:

答案 0 :(得分:0)

您需要的只是两个窗口函数一个用于MaxInDate,另一个用于ExcInstrMaxDate

import org.apache.spark.sql.expressions._
def windowSpec1 = Window.partitionBy("Account", "Instrument")
def windowSpec2 = Window.partitionBy("Account")

您还需要一个udf函数*,以MaxInDate *

Account 分组列表中删除当前的MaxInDate
import org.apache.spark.sql.functions._
def removeCurrentMax = udf((currentMax: String, listMax: Seq[String])=> listMax.filterNot(_ == currentMax))

并同时使用Window函数和udf函数

df.withColumn("MaxInDate", max("TrDate").over(windowSpec1))
  .withColumn("ExcInstrMaxDate", removeCurrentMax(col("MaxInDate"), collect_set("MaxInDate").over(windowSpec2)))
  .show(false)

你应该得到

+-------+----------+---------+---------+---------------+
|Account|Instrument|TrDate   |MaxInDate|ExcInstrMaxDate|
+-------+----------+---------+---------+---------------+
|1      |A         |3/1/2018 |3/2/2018 |[3/7/2018]     |
|1      |A         |3/2/2018 |3/2/2018 |[3/7/2018]     |
|1      |B         |3/3/2018 |3/7/2018 |[3/2/2018]     |
|1      |B         |3/6/2018 |3/7/2018 |[3/2/2018]     |
|1      |B         |3/6/2018 |3/7/2018 |[3/2/2018]     |
|1      |B         |3/7/2018 |3/7/2018 |[3/2/2018]     |
|2      |A         |2/7/2018 |2/7/2018 |[3/6/2018]     |
|2      |A         |2/5/2018 |2/7/2018 |[3/6/2018]     |
|2      |B         |2/15/2018|3/6/2018 |[2/7/2018]     |
|2      |B         |3/6/2018 |3/6/2018 |[2/7/2018]     |
+-------+----------+---------+---------+---------------+

我希望答案很有帮助

请注意,我使用TrDate作为StringType