使用窗口分析功能,我需要获取最大日期 - 不包括当前行的列值
Account,Instrument,TrDate
1,A,3/1/2018
1,A,3/2/2018
1,B,3/3/2018
1,B,3/6/2018
1,B,3/6/2018
1,B,3/7/2018
2,A,2/7/2018
2,A,2/5/2018
2,B,2/15/2018
2,B,3/6/2018
Account,Instrument,TrDate,MaxInDate,ExcInstrMaxDate
1,A,3/1/2018,3/2/2018,3/7/2018
1,A,3/2/2018,3/2/2018,3/7/2018
1,B,3/3/2018,3/7/2018,3/2/2018
1,B,3/6/2018,3/7/2018,3/2/2018
1,B,3/6/2018,3/7/2018,3/2/2018
1,B,3/7/2018,3/7/2018,3/2/2018
2,A,2/7/2018,2/7/2018,3/6/2018
2,A,2/5/2018,2/7/2018,3/6/2018
2,B,2/15/2018,3/6/2018,2/7/2018
2,B,3/6/2018,3/6/2018,2/7/2018
计算ExcInstrMaxDate
在帐户窗口中获取Max TrDate,不包括该特定说明 即对于Accout 1,工具A,ExcInstrMaxDate是由工具A过滤的帐户1的maxDate
答案 0 :(得分:0)
您需要的只是两个窗口函数一个用于MaxInDate
,另一个用于ExcInstrMaxDate
import org.apache.spark.sql.expressions._
def windowSpec1 = Window.partitionBy("Account", "Instrument")
def windowSpec2 = Window.partitionBy("Account")
您还需要一个udf
函数*,以MaxInDate
*
Account
分组列表中删除当前的MaxInDate
import org.apache.spark.sql.functions._
def removeCurrentMax = udf((currentMax: String, listMax: Seq[String])=> listMax.filterNot(_ == currentMax))
并同时使用Window
函数和udf
函数
df.withColumn("MaxInDate", max("TrDate").over(windowSpec1))
.withColumn("ExcInstrMaxDate", removeCurrentMax(col("MaxInDate"), collect_set("MaxInDate").over(windowSpec2)))
.show(false)
你应该得到
+-------+----------+---------+---------+---------------+
|Account|Instrument|TrDate |MaxInDate|ExcInstrMaxDate|
+-------+----------+---------+---------+---------------+
|1 |A |3/1/2018 |3/2/2018 |[3/7/2018] |
|1 |A |3/2/2018 |3/2/2018 |[3/7/2018] |
|1 |B |3/3/2018 |3/7/2018 |[3/2/2018] |
|1 |B |3/6/2018 |3/7/2018 |[3/2/2018] |
|1 |B |3/6/2018 |3/7/2018 |[3/2/2018] |
|1 |B |3/7/2018 |3/7/2018 |[3/2/2018] |
|2 |A |2/7/2018 |2/7/2018 |[3/6/2018] |
|2 |A |2/5/2018 |2/7/2018 |[3/6/2018] |
|2 |B |2/15/2018|3/6/2018 |[2/7/2018] |
|2 |B |3/6/2018 |3/6/2018 |[2/7/2018] |
+-------+----------+---------+---------+---------------+
我希望答案很有帮助
请注意,我使用TrDate
作为StringType