我有一个带有数组列的 spark 数据框,如下所示:
+--------------+
| x |
+--------------+
| [1, 1, 0, 1] |
| [0, 0, 0, 0] |
| [0, 0, 1, 1] |
| [0, 0, 0, 1] |
| [1, 0, 1] |
+--------------+
我想用另一个数组添加一个新列,该数组包含每个索引处 x
的累积和。结果应该是这样的:
+--------------+---------------+
| x | x_running_sum |
+--------------+---------------+
| [1, 1, 0, 1] | [1, 2, 2, 3] |
| [0, 0, 0, 0] | [0, 0, 0, 0] |
| [0, 0, 1, 1] | [0, 0, 1, 2] |
| [0, 0, 0, 1] | [0, 0, 0, 1] |
| [1, 0, 1] | [1, 1, 2] |
+--------------+---------------+
如何创建 x_running_sum
列?我尝试过使用一些高阶函数,如转换、聚合和 zip_with,但我还没有找到解决方案。
答案 0 :(得分:3)
为了执行累积求和,我按索引位置对数组进行切片并从中减少值:
from pyspark.sql import Row
df = spark.createDataFrame([
Row(x=[1, 1, 0, 1]),
Row(x=[0, 0, 0, 0]),
Row(x=[0, 0, 1, 1]),
Row(x=[0, 0, 0, 1]),
Row(x=[1, 0, 1])
])
(df
.selectExpr('x', "TRANSFORM(sequence(1, size(x)), index -> REDUCE(slice(x, 1, index), CAST(0 as BIGINT), (acc, el) -> acc + el)) AS x_running_sum")
.show(truncate=False))
输出
+------------+-------------+
|x |x_running_sum|
+------------+-------------+
|[1, 1, 0, 1]|[1, 2, 2, 3] |
|[0, 0, 0, 0]|[0, 0, 0, 0] |
|[0, 0, 1, 1]|[0, 0, 1, 2] |
|[0, 0, 0, 1]|[0, 0, 0, 1] |
|[1, 0, 1] |[1, 1, 2] |
+------------+-------------+