Question

在pyspark数据框中，我有一个值为1，-1和0的列，表示引擎的“启动”，“关闭”和“其他”事件。我想构建一个具有引擎状态的列，当引擎打开时为1，当它关闭时为0，如下所示：

$plan.description_section[0] has 10 items
$plan.description_section[1] has 12 items
$plan.description_section[2] has 15 items

如果总是交替使用1和-1，则可以使用窗口函数（例如

）轻松完成

$plan.description_section[0] will create 5 empty li's
$plan.description_section[1] will create 3 empty li's

但是，我可能会发生一些假的1或-1，如果我已分别处于状态1或0，我想忽略它。因此，我希望能够做到这样的事情：

+---+-----+-----+
|seq|event|state|
+---+-----+-----+
| 1 |   1 |   1 |
| 2 |   0 |   1 |
| 3 |  -1 |   0 |
| 4 |   0 |   0 |
| 5 |   0 |   0 |
| 6 |   1 |   1 |
| 7 |  -1 |   0 |
+---+-----+-----+

这将需要一个“饱和”的求和函数，它永远不会超过1或低于0，或者其他一些我目前无法想象的方法。

有人有什么想法吗？

Answer 1

您可以使用最后一个函数来实现所需的结果，以填充最近的状态更改。

from pyspark.sql import functions as F
from pyspark.sql.window import Window

df = (spark.createDataFrame([
        (1, 1),
        (2, 0),
        (3, -1),
        (4, 0)
    ], ["seq", "event"]))

w = Window.orderBy('seq')

#replace zeros with nulls so they can be ignored easily.
df = df.withColumn('helperCol',F.when(df.event != 0,df.event))

#fill statechanges forward in a new column.
df = df.withColumn('state',F.last(df.helperCol,ignorenulls=True).over(w))

#replace -1 values with 0
df = df.replace(-1,0,['state'])

df.show()

这会产生：

+---+-----+---------+-----+
|seq|event|helperCol|state|
+---+-----+---------+-----+
|  1|    1|        1|    1|
|  2|    0|     null|    1|
|  3|   -1|       -1|    0|
|  4|    0|     null|    0|
+---+-----+---------+-----+

helperCol无需添加到数据框中，我只包含它以使该过程更具可读性。

使用窗口函数设计饱和和

1 个答案: