Spark Dataframe - 为特定KEY组的VALUE更改写入新记录

时间:2017-11-19 04:44:24

标签: scala apache-spark apache-spark-sql user-defined-functions

当" AMT"有变化时需要写一行特定" KEY"的列组。

例如:

Scenarios-1: For KEY=2, first change is 90 to 20, So need to write a record with value (20-90). 
Similarly the next change for the same key group is 20 to 30.5, So again need to write another record with value (30.5 - 20) 

Scenarios-2: For KEY=1, only one record for this KEY group so write as is

Scenarios-3: For KEY=3, Since the same AMT value exists twice, so write once

如何实施?使用窗口函数还是groupBy agg函数?

示例输入数据:

val DF1 = List((1,34.6),(2,90.0),(2,90.0),(2,20.0),(2,30.5),(3,89.0),(3,89.0)).toDF("KEY", "AMT")

DF1.show(false)
+-----+-------------------+
|KEY  |AMT                |
+-----+-------------------+
|1    |34.6               |
|2    |90.0               |
|2    |90.0               |
|2    |20.0               |----->[ 20.0 - 90.0 = -70.0 ]
|2    |30.5               |----->[ 30.5 - 20.0 =  10.5 ]
|3    |89.0               |
|3    |89.0               |
+-----+-------------------+

预期值:

scala> df2.show()
+----+--------------------+
|KEY | AMT                |
+----+--------------------+
|  1 |       34.6         |-----> As Is 
|  2 |       -70.0        |----->[ 20.0 - 90.0 = -70.0 ]
|  2 |       10.5         |----->[ 30.5 - 20.0 =  10.5 ]
|  3 |       89.0         |-----> As Is, with one record only
+----+--------------------+

2 个答案:

答案 0 :(得分:0)

我试图在pyspark解决它而不是scala。

from pyspark.sql.functions import lead
from pyspark.sql.window import Window
w1=Window().partitionBy("key").orderBy("key")
DF4 =spark.createDataFrame([(1,34.6),(2,90.0),(2,90.0),(2,20.0),(2,30.5),(3,89.0),(3,89.0)],["KEY", "AMT"])
DF4.createOrReplaceTempView('keyamt')
DF7=spark.sql('select distinct key,amt from keyamt where key in ( select key from (select key,count(distinct(amt))dist from keyamt group by key) where dist=1)')
DF8=DF4.join(DF7,DF4['KEY']==DF7['KEY'],'leftanti').withColumn('new_col',((lag('AMT',1).over(w1)).cast('double') ))
DF9=DF8.withColumn('new_col1', ((DF8['AMT']-DF8['new_col'].cast('double'))))
DF9.withColumn('new_col1', ((DF9['AMT']-DF9['new_col'].cast('double')))).na.fill(0)
DF9.filter(DF9['new_col1'] !=0).select(DF9['KEY'],DF9['new_col1']).union(DF7).orderBy(DF9['KEY'])

输出:

+---+--------+
|KEY|new_col1|
+---+--------+
|  1|    34.6|
|  2|   -70.0|
|  2|    10.5|
|  3|    89.0|
+---+--------+

答案 1 :(得分:0)

您可以使用window功能结合whenleadmonotically_increasing_id() 订购withColumn来实施您的逻辑api如下

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val windowSpec = Window.partitionBy("KEY").orderBy("rowNo")
val tempdf = DF1.withColumn("rowNo", monotonically_increasing_id())
tempdf.select($"KEY", when(lead("AMT", 1).over(windowSpec).isNull || (lead("AMT", 1).over(windowSpec)-$"AMT").as("AMT")===lit(0.0), $"AMT").otherwise(lead("AMT", 1).over(windowSpec)-$"AMT").as("AMT")).show(false)