计算pyspark数组列的累积总和

时间:2021-06-02 21:28:08

标签: apache-spark pyspark apache-spark-sql

我有一个带有数组列的 spark 数据框,如下所示:

+--------------+
|            x |
+--------------+
| [1, 1, 0, 1] |
| [0, 0, 0, 0] |
| [0, 0, 1, 1] |
| [0, 0, 0, 1] |
|    [1, 0, 1] |
+--------------+

我想用另一个数组添加一个新列,该数组包含每个索引处 x 的累积和。结果应该是这样的:

+--------------+---------------+
|            x | x_running_sum |
+--------------+---------------+
| [1, 1, 0, 1] |  [1, 2, 2, 3] |
| [0, 0, 0, 0] |  [0, 0, 0, 0] |
| [0, 0, 1, 1] |  [0, 0, 1, 2] |
| [0, 0, 0, 1] |  [0, 0, 0, 1] |
|    [1, 0, 1] |     [1, 1, 2] |
+--------------+---------------+

如何创建 x_running_sum 列?我尝试过使用一些高阶函数,如转换、聚合和 zip_with,但我还没有找到解决方案。

1 个答案:

答案 0 :(得分:3)

为了执行累积求和,我按索引位置对数组进行切片并从中减少值:

from pyspark.sql import Row


df = spark.createDataFrame([
  Row(x=[1, 1, 0, 1]),
  Row(x=[0, 0, 0, 0]),
  Row(x=[0, 0, 1, 1]),
  Row(x=[0, 0, 0, 1]),
  Row(x=[1, 0, 1])
])

(df
 .selectExpr('x', "TRANSFORM(sequence(1, size(x)), index -> REDUCE(slice(x, 1, index), CAST(0 as BIGINT), (acc, el) -> acc + el)) AS x_running_sum")
 .show(truncate=False))

输出

+------------+-------------+
|x           |x_running_sum|
+------------+-------------+
|[1, 1, 0, 1]|[1, 2, 2, 3] |
|[0, 0, 0, 0]|[0, 0, 0, 0] |
|[0, 0, 1, 1]|[0, 0, 1, 2] |
|[0, 0, 0, 1]|[0, 0, 0, 1] |
|[1, 0, 1]   |[1, 1, 2]    |
+------------+-------------+