我正在使用pyspark中的Window函数来计算将来的累积总和,但是范围比预期的要低。如果我指定所有将来的行,我得到的是过去的累计总和。我有虫子吗?这是我的示例:
from pyspark.sql.window import Window
def undiscountedCummulativeFutureReward(df):
windowSpec = Window \
.partitionBy('user') \
.orderBy('time') \
.rangeBetween(0, sys.maxsize)
tot_reward = F.sum('reward').over(windowSpec)
df_tot_reward = df.withColumn('undiscounted', tot_reward)
return df_tot_reward
def makeData(spark):
data = [{'user': 'bob', 'time': 3, 'reward': 10},
{'user': 'bob', 'time': 4, 'reward': 9},
{'user': 'bob', 'time': 5, 'reward': 11},
{'user': 'jo', 'time': 4, 'reward': 6},
{'user': 'jo', 'time': 5, 'reward': 7},
]
schema = T.StructType([T.StructField('user', T.StringType(), False),
T.StructField('time', T.IntegerType(), False),
T.StructField('reward', T.IntegerType(), False)])
return spark.createDataFrame(data=data, schema=schema)
def main(spark):
df = makeData(spark)
df = undiscountedCummulativeFutureReward(df)
df.orderBy('user', 'time').show()
return df
运行此命令,我得到
+----+----+------+------------+
|user|time|reward|undiscounted|
+----+----+------+------------+
| bob| 3| 10| 30|
| bob| 4| 9| 20|
| bob| 5| 11| 11|
| jo| 4| 6| 13|
| jo| 5| 7| 7|
+----+----+------+------------+
答案 0 :(得分:0)
如果您查看rangeBetween
的文档,则会显示:
We recommend users use ``Window.unboundedPreceding``, ``Window.unboundedFollowing``,
and ``Window.currentRow`` to specify special boundary values, rather than using integral
values directly.
并修复了它-如果我将窗口规格更改为
windowSpec = Window \
.partitionBy('user') \
.orderBy('time') \
.rangeBetween(0, Window.unboundedFollowing)
然后按预期运行。