Eg: I would like to add the quantity sold by the date.
Date Quantity
11/4/2017 20
11/4/2017 23
11/4/2017 12
11/5/2017 18
11/5/2017 12
Output with the new Column:
Date Quantity, New_Column
11/4/2017 20 55
11/4/2017 23 55
11/4/2017 12 55
11/5/2017 18 30
11/5/2017 12 30
答案 0 :(得分:3)
通过指定 WindowSpec ,只需使用sum
作为窗口函数:
import org.apache.spark.sql.expressions.Window
df.withColumn("New_Column", sum("Quantity").over(Window.partitionBy("Date"))).show
+---------+--------+----------+
| Date|Quantity|New_Column|
+---------+--------+----------+
|11/5/2017| 18| 30|
|11/5/2017| 12| 30|
|11/4/2017| 20| 55|
|11/4/2017| 23| 55|
|11/4/2017| 12| 55|
+---------+--------+----------+