pyspark窗口功能范围向后退?

时间:2018-09-27 23:23:26

标签: pyspark

我正在使用pyspark中的Window函数来计算将来的累积总和,但是范围比预期的要低。如果我指定所有将来的行,我得到的是过去的累计总和。我有虫子吗?这是我的示例:

from pyspark.sql.window import Window


def undiscountedCummulativeFutureReward(df):
    windowSpec = Window \
        .partitionBy('user') \
        .orderBy('time') \
        .rangeBetween(0, sys.maxsize)

    tot_reward = F.sum('reward').over(windowSpec)

    df_tot_reward = df.withColumn('undiscounted', tot_reward)
    return df_tot_reward


def makeData(spark):
    data = [{'user': 'bob', 'time': 3, 'reward': 10},
            {'user': 'bob', 'time': 4, 'reward': 9},
            {'user': 'bob', 'time': 5, 'reward': 11},
            {'user': 'jo', 'time': 4, 'reward': 6},
            {'user': 'jo', 'time': 5, 'reward': 7},
            ]
    schema = T.StructType([T.StructField('user', T.StringType(), False),
                           T.StructField('time', T.IntegerType(), False),
                           T.StructField('reward', T.IntegerType(), False)])

    return spark.createDataFrame(data=data, schema=schema)


def main(spark):
    df = makeData(spark)
    df = undiscountedCummulativeFutureReward(df)
    df.orderBy('user', 'time').show()
    return df

运行此命令,我得到

+----+----+------+------------+                                                 
|user|time|reward|undiscounted|
+----+----+------+------------+
| bob|   3|    10|          30|
| bob|   4|     9|          20|
| bob|   5|    11|          11|
|  jo|   4|     6|          13|
|  jo|   5|     7|           7|
+----+----+------+------------+

1 个答案:

答案 0 :(得分:0)

如果您查看rangeBetween的文档,则会显示:

We recommend users use ``Window.unboundedPreceding``, ``Window.unboundedFollowing``,
and ``Window.currentRow`` to specify special boundary values, rather than using integral
values directly.

并修复了它-如果我将窗口规格更改为

windowSpec = Window \
    .partitionBy('user') \
    .orderBy('time') \
    .rangeBetween(0, Window.unboundedFollowing)

然后按预期运行。