如何计算pyspark表中的累积和

时间:2017-10-27 19:24:32

标签: python pyspark

我在pyspark上有一个使用交叉表函数的表,如下所示:

df = sqlContext.createDataFrame( [(1,2,"a"),(3,2,"a"),(1,3,"b"),(2,2,"a"),(2,3,"b")],
                             ["time", "value", "class"] )

tabla = df.crosstab("value","class")
tabla.withColumn("Total",tabla.a + tabla.b).show()


+-----------+---+---+-----+
|value_class|  a|  b|Total|
+-----------+---+---+-----+
|          2|  4|  0|    4|
|          4|  1|  2|    3|
|          3|  1|  4|    5|
+-----------+---+---+-----+

我需要汇总一个新的列,表示“总计”的累积总和

1 个答案:

答案 0 :(得分:0)

希望这会有所帮助:

我刚刚给出了一个例子,但您可以使用partitionBy,orderBy等来创建窗口。

library(ggtern)

a <- data.frame(x=c(0.1,0.9,0),
            y=c(0.4,0.2,0.4),
            z=c(0.3,0.4,0.3))

b <- data.frame(x=c(0.5,0.5,0),
    y=c(0.4,0.4,0.2),
    z=c(0.5,0.3,0.2))

df = rbind(a,b) 

ggtern(data=df,aes(x,y,z)) +
    geom_point()