从顶部/底部行col值复制丢失的数据

时间:2019-04-15 07:51:03

标签: scala apache-spark apache-spark-sql

我有一个dataframe,上面有索引,类别和其他几列。索引和类别永远不能为空。但是其他列数据为空,当所有其他列数据为空时,我们必须根据类别从顶部/底部行值进行复制。

val df = Seq(
  (1,1, null, null, null ), 
  (2,1, null, null, null ), 
  (3,1, null, null, null ), 
  (4,1,"123.12", "124.52", "95.98" ), 
  (5,1, "452.12", "478.65", "1865.12" ), 
  (1,2,"2014.21", "147", "265"), 
  (2,2, "1457", "12483.00", "215.21"), 
  (3,2, null, null, null),
  (4,2, null, null, null) ).toDF("index", "category", "col1", "col2", "col3")


scala> df.show
+-----+--------+-------+--------+-------+
|index|category|   col1|    col2|   col3|
+-----+--------+-------+--------+-------+
|    1|       1|   null|    null|   null|
|    2|       1|   null|    null|   null|
|    3|       1|   null|    null|   null|
|    4|       1| 123.12|  124.52|  95.98|
|    5|       1| 452.12|  478.65|1865.12|
|    1|       2|2014.21|     147|    265|
|    2|       2|   1457|12483.00| 215.21|
|    3|       2|   null|    null|   null|
|    4|       2|   null|    null|   null|
+-----+--------+-------+--------+-------+

预期dataframe如下

+-----+--------+-------+--------+-------+
|index|category|   col1|    col2|   col3|
+-----+--------+-------+--------+-------+
|    1|       1| 123.12|  124.52|  95.98|       // Copied from below for same category
|    2|       1| 123.12|  124.52|  95.98|       // Copied from below for same category
|    3|       1| 123.12|  124.52|  95.98|
|    4|       1| 123.12|  124.52|  95.98|
|    5|       1| 452.12|  478.65|1865.12|
|    1|       2|2014.21|     147|    265|
|    2|       2|   1457|12483.00| 215.21|
|    3|       2|   1457|12483.00| 215.21|       // Copied from above for same category
|    4|       2|   1457|12483.00| 215.21|       // Copied from above for same category
+-----+--------+-------+--------+-------+   

1 个答案:

答案 0 :(得分:1)

更新:当几行可能为空时,必须使用高级Windows:

val cols = Seq("col1", "col2", "col3")
val beforeWindow = Window
  .partitionBy("category")
  .orderBy("index")
  .rangeBetween(Window.unboundedPreceding, Window.currentRow)

val afterWindow = Window
  .partitionBy("category")
  .orderBy("index")
  .rangeBetween(Window.currentRow, Window.unboundedFollowing)

val result = cols.foldLeft(df)((updated, columnName) =>
  updated.withColumn(columnName,
    coalesce(col(columnName),
      last(columnName, ignoreNulls = true).over(beforeWindow),
      first(columnName, ignoreNulls = true).over(afterWindow)
    ))
)

在一个空的情况下,可以使用Window函数“ lead”,“ lag”和“ coalesce”来解决:

val cols = Seq("col1", "col2", "col3")
val categoryWindow = Window.partitionBy("category").orderBy("index")

val result = cols.foldLeft(df)((updated, columnName) =>
  updated.withColumn(columnName,
    coalesce(col(columnName),
      lag(col(columnName), 1).over(categoryWindow),
      lead(col(columnName), 1).over(categoryWindow)
    ))
)
result.show(false)

输出:

+-----+--------+-------+--------+-------+
|index|category|col1   |col2    |col3   |
+-----+--------+-------+--------+-------+
|1    |1       |123.12 |124.52  |95.98  |
|2    |1       |123.12 |124.52  |95.98  |
|3    |1       |452.12 |478.65  |1865.12|
|1    |2       |2014.21|147     |265    |
|2    |2       |1457   |12483.00|215.21 |
|3    |2       |1.25   |3.45    |26.3   |
|4    |2       |1.25   |3.45    |26.3   |
+-----+--------+-------+--------+-------+