问题

Question

问题

我有一个Spark DataFrame，其中的一列不包含每行的值，而仅包含某些行的值（在一定的规则基础上，例如，基于id仅每5至10行）。

现在，我想对包含涉及前两行和后两行的值的行应用一个窗口函数，也包含值（因此，基本上假设所有包含空值的行都不会存在=不计入窗口的rowsBetween范围）。实际上，我的有效窗口大小可以任意决定，具体取决于存在多少个包含空值的行。但是，我始终需要前后两个值。另外，由于其他包含重要信息的列，最终结果应包含所有行。

示例

例如，我要计算以下数据帧中不为空的行的前两个值，当前值和后两个值（非null）的总和：

from pyspark.sql.window import Window
import pyspark.sql.functions as F
from pyspark.sql import Row

df = spark.createDataFrame([Row(id=i, val=i * 2 if i % 5 == 0 else None, foo='other') for i in range(100)])
df.show()

输出：

+-----+---+----+
|  foo| id| val|
+-----+---+----+
|other|  0|   0|
|other|  1|null|
|other|  2|null|
|other|  3|null|
|other|  4|null|
|other|  5|  10|
|other|  6|null|
|other|  7|null|
|other|  8|null|
|other|  9|null|
|other| 10|  20|
|other| 11|null|
|other| 12|null|
|other| 13|null|
|other| 14|null|
|other| 15|  30|
|other| 16|null|
|other| 17|null|
|other| 18|null|
|other| 19|null|
+-----+---+----+

如果仅按原样在数据帧上使用Window函数，则无法指定条件值不能为null，因此窗口仅包含null值，使总和等于行值：

df2 = df.withColumn('around_sum', F.when(F.col('val').isNotNull(), F.sum(F.col('val')).over(Window.rowsBetween(-2, 2).orderBy(F.col('id')))).otherwise(None))
df2.show()

结果：

+-----+---+----+----------+
|  foo| id| val|around_sum|
+-----+---+----+----------+
|other|  0|   0|         0|
|other|  1|null|      null|
|other|  2|null|      null|
|other|  3|null|      null|
|other|  4|null|      null|
|other|  5|  10|        10|
|other|  6|null|      null|
|other|  7|null|      null|
|other|  8|null|      null|
|other|  9|null|      null|
|other| 10|  20|        20|
|other| 11|null|      null|
|other| 12|null|      null|
|other| 13|null|      null|
|other| 14|null|      null|
|other| 15|  30|        30|
|other| 16|null|      null|
|other| 17|null|      null|
|other| 18|null|      null|
|other| 19|null|      null|
+-----+---+----+----------+

通过创建仅包含该值不为null的行的第二个数据框，然后在其中执行窗口操作，然后再次将结果联接起来，我能够达到预期的结果：

df3 = df.where(F.col('val').isNotNull())\
    .withColumn('around_sum', F.sum(F.col('val')).over(Window.rowsBetween(-2, 2).orderBy(F.col('id'))))\
    .select(F.col('around_sum'), F.col('id').alias('id2'))
df3 = df.join(df3, F.col('id') == F.col('id2'), 'outer').orderBy(F.col('id')).drop('id2')
df3.show()

结果：

+-----+---+----+----------+
|  foo| id| val|around_sum|
+-----+---+----+----------+
|other|  0|   0|        30|
|other|  1|null|      null|
|other|  2|null|      null|
|other|  3|null|      null|
|other|  4|null|      null|
|other|  5|  10|        60|
|other|  6|null|      null|
|other|  7|null|      null|
|other|  8|null|      null|
|other|  9|null|      null|
|other| 10|  20|       100|
|other| 11|null|      null|
|other| 12|null|      null|
|other| 13|null|      null|
|other| 14|null|      null|
|other| 15|  30|       150|
|other| 16|null|      null|
|other| 17|null|      null|
|other| 18|null|      null|
|other| 19|null|      null|
+-----+---+----+----------+

问题

现在，我想知道是否可以以某种方式摆脱联接（和第二个DataFrame），而直接在Window函数中指定条件。

这可能吗？

Answer 1

一个好的解决方案是从以0填充零开始，然后执行操作。仅在涉及的列上执行fillna，如下所示：

df = df.fillna(0,subset=['val'])

如果不确定是否要删除空值，请复制列值，然后计算该列上的窗口，以便在操作后可以删除它。

赞：

df = df.withColumn('val2',F.col('val'))
df = df.fillna(0,subset=['val2'])
# Then perform the operations over val2.
df = df.withColumn('around_sum', F.sum(F.col('val2')).over(Window.rowsBetween(-2, 2).orderBy(F.col('id'))))
# After the operations, get rid of the copy column
df = df.drop('val2')

Window.rowsBetween-仅考虑满足特定条件的行（例如不为空）

问题

示例

问题

1 个答案: