我有客户端,时间戳和所有列,我需要实现一个名为“required”的列
所需列是“所有”列的当前行和上一行值与当前行的列表元素之差的结果。
但是,当前行的结果将用作前一行来计算下一列之间的差异。如何使用spark Scala在下一行中获取上一行计算值。我用下面的udf来实现。
+--------------+-------------------+--------------------------------------------------+---------------------------------------------
|CLIENT_ID |timestamp |all |Required
+--------------+-------------------+--------------------------------------------------+--------------------------------------------
|69415092|2002-03-15 00:00:00|[[06,718], [07,718]] |[[06,718], [07,718]]
|69415092|2002-03-19 00:00:00|[[10,718]] |[[06,718], [07,718],[10,718]]
|69415092|2002-03-22 00:00:00|[[06,223],[12,718]] |[[07,718],[10,718],[12,718],[06,223]]
|69415092|2002-11-16 00:00:00|[[12,386]] |[[07,718],[10,718],[06,223],[12,386]]
但是现有列中的计算值未更新。
val window = Window.partitionBy("CLIENT_ID").orderBy("timestamp")
def fun1(s1: Seq[String],s2: Seq[String]): Seq[String] = {
var un= s2.diff(s1)
if( un.contains("0") || un.isEmpty){
un=s1
}
else{
var a = un.toArray
un =concat(a,s1.toArray)
}
un
}
val funUdf = udf(fun1 _)
var uniondf = df3.withColumn("Required", funUdf("all",lag("all", 1, Array("0")).over(window))).select("CLP_CLIENT_ID","timestamp","all","Required")
uniondf.show(false)