我有一个火花数据框,我想根据当前行数量值和基于groupid和id的Amount值的上一行总和计算一个运行总计。让我拿出df
import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
import pandas as pd
sc = spark.sparkContext
data1 = {'date': {0: '2018-04-03', 1: '2018-04-04', 2: '2018-04-05', 3: '2018-04-06', 4: '2018-04-07'},
'id': {0: 'id1', 1: 'id2', 2: 'id1', 3: 'id3', 4: 'id2'},
'group': {0: '1', 1: '1', 2: '1', 3: '2', 4: '1'},
'amount': {0: 50, 1: 40, 2: 50, 3: 55, 4: 20}}
df1_pd = pd.DataFrame(data1, columns=data1.keys())
df1 = spark.createDataFrame(df1_pd)
df1.show()
+----------+---+-----+------+
| date| id|group|amount|
+----------+---+-----+------+
|2018-04-03|id1| 1| 50|
|2018-04-04|id2| 1| 40|
|2018-04-05|id1| 1| 50|
|2018-04-06|id3| 2| 55|
|2018-04-07|id2| 1| 20|
+----------+---+-----+------+
我正在寻找的输出
+----------+---+-----+------+---+
| date| id|group|amount|sum|
+----------+---+-----+------+---+
|2018-04-03|id1| 1| 50|50 |
|2018-04-04|id2| 1| 40|90 |
|2018-04-05|id1| 1| 50|140|
|2018-04-06|id3| 2| 55|55 |
|2018-04-07|id2| 1| 20|160|
+----------+---+-----+------+---+
答案 0 :(得分:3)
窗口定义:
from pyspark.sql.window import Window
from pyspark.sql.functions import sum
w = Window.partitionBy("group").orderBy("date").rowsBetween(
Window.unboundedPreceding, # Take all rows from the beginning of frame
Window.currentRow # To current row
)
和:
(df1.withColumn("sum", sum("amount").over(w))
.orderBy("date") # Sort for easy inspection. Not necessary
.show())
结果:
+----------+---+-----+------+---+
| date| id|group|amount|sum|
+----------+---+-----+------+---+
|2018-04-03|id1| 1| 50| 50|
|2018-04-04|id2| 1| 40| 90|
|2018-04-05|id1| 1| 50|140|
|2018-04-06|id3| 2| 55| 55|
|2018-04-07|id2| 1| 20|160|
+----------+---+-----+------+---+