用上一个和下一个非缺失值填充行缺失值

时间:2020-08-20 13:08:02

标签: pyspark apache-spark-sql pyspark-dataframes

我知道您可以将上一个函数与窗口函数结合使用的下一个非缺失值向前/向后填写缺失值。

但我有一个数据如下:

Area,Date,Population
A, 1/1/2000, 10000
A, 2/1/2000, 
A, 3/1/2000, 
A, 4/1/2000, 10030
A, 5/1/2000, 

在此示例中,对于5月人口,我想填写10030,这很容易。但是对于2月和3月,我想填写的值是10000和10030的平均值,而不是10000或10030。

您知道如何实现吗?

谢谢

2 个答案:

答案 0 :(得分:1)

获取nextprevious的值并计算平均值,如下所示-

df2.show(false)
    df2.printSchema()
    /**
      * +----+--------+----------+
      * |Area|Date    |Population|
      * +----+--------+----------+
      * |A   |1/1/2000|10000     |
      * |A   |2/1/2000|null      |
      * |A   |3/1/2000|null      |
      * |A   |4/1/2000|10030     |
      * |A   |5/1/2000|null      |
      * +----+--------+----------+
      *
      * root
      * |-- Area: string (nullable = true)
      * |-- Date: string (nullable = true)
      * |-- Population: integer (nullable = true)
      */

    val w1 = Window.partitionBy("Area").orderBy("Date").rowsBetween(Window.unboundedPreceding, Window.currentRow)
    val w2 = Window.partitionBy("Area").orderBy("Date").rowsBetween(Window.currentRow, Window.unboundedFollowing)
    df2.withColumn("previous", last("Population", ignoreNulls = true).over(w1))
      .withColumn("next", first("Population", ignoreNulls = true).over(w2))
      .withColumn("new_Population", (coalesce($"previous", $"next") + coalesce($"next", $"previous")) / 2)
      .drop("next", "previous")
      .show(false)

    /**
      * +----+--------+----------+--------------+
      * |Area|Date    |Population|new_Population|
      * +----+--------+----------+--------------+
      * |A   |1/1/2000|10000     |10000.0       |
      * |A   |2/1/2000|null      |10015.0       |
      * |A   |3/1/2000|null      |10015.0       |
      * |A   |4/1/2000|10030     |10030.0       |
      * |A   |5/1/2000|null      |10030.0       |
      * +----+--------+----------+--------------+
      */

答案 1 :(得分:0)

这是我的尝试。

w1w2用于分隔窗口,而w3w4用于填充前面和后面的值。之后,您可以给条件以计算Population的填充方式。

import pyspark.sql.functions as f
from pyspark.sql import Window

w1 = Window.partitionBy('Area').orderBy('Date').rowsBetween(Window.unboundedPreceding, Window.currentRow)
w2 = Window.partitionBy('Area').orderBy('Date').rowsBetween(Window.currentRow, Window.unboundedFollowing)
w3 = Window.partitionBy('Area', 'partition1').orderBy('Date')
w4 = Window.partitionBy('Area', 'partition2').orderBy(f.desc('Date'))

df.withColumn('check', f.col('Population').isNotNull().cast('int')) \
  .withColumn('partition1', f.sum('check').over(w1)) \
  .withColumn('partition2', f.sum('check').over(w2)) \
  .withColumn('first', f.first('Population').over(w3)) \
  .withColumn('last',  f.first('Population').over(w4)) \
  .withColumn('fill', f.when(f.col('first').isNotNull() & f.col('last').isNotNull(), (f.col('first') + f.col('last')) / 2).otherwise(f.coalesce('first', 'last'))) \
  .withColumn('Population', f.coalesce('Population', 'fill')) \
  .orderBy('Date') \
  .select(*df.columns).show(10, False)

+----+--------+----------+
|Area|Date    |Population|
+----+--------+----------+
|A   |1/1/2000|10000.0   |
|A   |2/1/2000|10015.0   |
|A   |3/1/2000|10015.0   |
|A   |4/1/2000|10030.0   |
|A   |5/1/2000|10030.0   |
+----+--------+----------+