输入:
val customers = sc.parallelize(List(("Alice", "2016-05-01", 50.00,4),
("Alice", "2016-05-03", 45.00,2),
("Alice", "2016-05-04", 55.00,4),
("Bob", "2016-05-01", 25.00,6),
("Bob", "2016-05-04", 29.00,7),
("Bob", "2016-05-06", 27.00,10))).
toDF("name", "date", "amountSpent","NumItems")
程序:
// Import the window functions.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
// Create a window spec.
val wSpec1 = Window.partitionBy("name").orderBy("date").rowsBetween(-1, 1)
在此窗口规范中,数据由客户分区。每个客户的数据按日期排序。并且,窗口框架被定义为从-1(当前行之前的一行)开始并且结束于1(当前行之后的一行),在滑动窗口中总共3行。 问题是对列的列表进行基于窗口的求和。在这种情况下,他们是" amountSpent"," NumItems"。但问题可能有多达数百列。
以下是为每列进行基于窗口的求和的解决方案。但是,如何更有效地执行求和?因为我们不需要每次为每列找到滑动窗口行。
// Calculate the sum of spent
customers.withColumn("sumSpent",sum(customers("amountSpent")).over(wSpec1)).show()
+-----+----------+-----------+--------+--------+
| name| date|amountSpent|NumItems|sumSpent|
+-----+----------+-----------+--------+--------+
|Alice|2016-05-01| 50.0| 4| 95.0|
|Alice|2016-05-03| 45.0| 2| 150.0|
|Alice|2016-05-04| 55.0| 4| 100.0|
| Bob|2016-05-01| 25.0| 6| 54.0|
| Bob|2016-05-04| 29.0| 7| 81.0|
| Bob|2016-05-06| 27.0| 10| 56.0|
+-----+----------+-----------+--------+--------+
// Calculate the sum of items
customers.withColumn( "sumItems",
sum(customers("NumItems")).over(wSpec1) ).show()
+-----+----------+-----------+--------+--------+
| name| date|amountSpent|NumItems|sumItems|
+-----+----------+-----------+--------+--------+
|Alice|2016-05-01| 50.0| 4| 6|
|Alice|2016-05-03| 45.0| 2| 10|
|Alice|2016-05-04| 55.0| 4| 6|
| Bob|2016-05-01| 25.0| 6| 13|
| Bob|2016-05-04| 29.0| 7| 23|
| Bob|2016-05-06| 27.0| 10| 17|
+-----+----------+-----------+--------+--------+
答案 0 :(得分:3)
目前,我猜,它无法使用Window函数更新多个列。你可以表现得好像它在下面同时发生
val customers = sc.parallelize(List(("Alice", "2016-05-01", 50.00,4),
("Alice", "2016-05-03", 45.00,2),
("Alice", "2016-05-04", 55.00,4),
("Bob", "2016-05-01", 25.00,6),
("Bob", "2016-05-04", 29.00,7),
("Bob", "2016-05-06", 27.00,10))).
toDF("name", "date", "amountSpent","NumItems")
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
// Create a window spec.
val wSpec1 = Window.partitionBy("name").orderBy("date").rowsBetween(-1, 1)
var tempdf = customers
val colNames = List("amountSpent", "NumItems")
for(column <- colNames){
tempdf = tempdf.withColumn(column+"Sum", sum(tempdf(column)).over(wSpec1))
}
tempdf.show(false)
您应该输出
+-----+----------+-----------+--------+--------------+-----------+
|name |date |amountSpent|NumItems|amountSpentSum|NumItemsSum|
+-----+----------+-----------+--------+--------------+-----------+
|Bob |2016-05-01|25.0 |6 |54.0 |13 |
|Bob |2016-05-04|29.0 |7 |81.0 |23 |
|Bob |2016-05-06|27.0 |10 |56.0 |17 |
|Alice|2016-05-01|50.0 |4 |95.0 |6 |
|Alice|2016-05-03|45.0 |2 |150.0 |10 |
|Alice|2016-05-04|55.0 |4 |100.0 |6 |
+-----+----------+-----------+--------+--------------+-----------+
答案 1 :(得分:3)
是的,可以只计算一次窗口(如果你有Spark 2允许你使用带结构类型的val colNames = List("amountSpent","NumItems")
val cols= colNames.map(col(_))
// put window-content of all columns in one struct
val df_wc_arr = customers
.withColumn("window_content_arr",collect_list(struct(cols:_*)).over(wSpec1))
// calculate sum of window-content for each column
// aggregation exression used later
val aggExpr = colNames.map(n => sum(col("window_content."+n)).as(n+"Sum"))
df_wc_arr
.withColumn("window_content",explode($"window_content_arr"))
.drop($"window_content_arr")
.groupBy(($"name" :: $"date" :: cols):_*)
.agg(aggExpr.head,aggExpr.tail:_*)
.orderBy($"name",$"date")
.show
),假设你的代码中有数据框和windowSpec,那么:
+-----+----------+-----------+--------+--------------+-----------+
| name| date|amountSpent|NumItems|amountSpentSum|NumItemsSum|
+-----+----------+-----------+--------+--------------+-----------+
|Alice|2016-05-01| 50.0| 4| 95.0| 6|
|Alice|2016-05-03| 45.0| 2| 150.0| 10|
|Alice|2016-05-04| 55.0| 4| 100.0| 6|
| Bob|2016-05-01| 25.0| 6| 54.0| 13|
| Bob|2016-05-04| 29.0| 7| 81.0| 23|
| Bob|2016-05-06| 27.0| 10| 56.0| 17|
+-----+----------+-----------+--------+--------------+-----------+
给出
{{1}}