使用DataFrame按组分组的Python Spark累积和

时间:2017-08-29 18:50:04

标签: apache-spark pyspark spark-dataframe

如何使用DataFrame abstraction专门计算每组的累计总和;并在PySpark

使用如下示例数据集:

df = sqlContext.createDataFrame( [(1,2,"a"),(3,2,"a"),(1,3,"b"),(2,2,"a"),(2,3,"b")], 
                                 ["time", "value", "class"] )

+----+-----+-----+
|time|value|class|
+----+-----+-----+
|   1|    2|    a|
|   3|    2|    a|
|   1|    3|    b|
|   2|    2|    a|
|   2|    3|    b|
+----+-----+-----+

我想在(有序)value变量上为每个class分组添加time的累积和列。

2 个答案:

答案 0 :(得分:37)

这可以使用窗口函数和窗口范围内的Window.unboundedPreceding值的组合来完成,如下所示:

from pyspark.sql import Window
from pyspark.sql import functions as F

windowval = (Window.partitionBy('class').orderBy('time')
             .rangeBetween(Window.unboundedPreceding, 0))
df_w_cumsum = df.withColumn('cum_sum', F.sum('value').over(windowval))
df_w_cumsum.show()
+----+-----+-----+-------+
|time|value|class|cum_sum|
+----+-----+-----+-------+
|   1|    3|    b|      3|
|   2|    3|    b|      6|
|   1|    2|    a|      2|
|   2|    2|    a|      4|
|   3|    2|    a|      6|
+----+-----+-----+-------+

答案 1 :(得分:1)

我已经尝试过这种方法,并且对我有用。

from pyspark.sql import Window

from pyspark.sql import functions as f

import sys

cum_sum = DF.withColumn('cumsum', f.sum('value').over(Window.partitionBy('class').orderBy('time').rowsBetween(-sys.maxsize, 0)))
cum_sum.show()