用下一个值填充数据框列中的空值

时间:2019-03-27 10:04:41

标签: scala apache-spark

我必须用数据帧中同一列的立即值填充第一个空值。此逻辑仅适用于该列的第一个连续的空值。

我有一个与下面类似的数据框

 //I replaced null to 0 in value column
 val df = Seq( (0,"exA",30), (0,"exB",22), (0,"exC",19), (16,"exD",13),
               (5,"exE",28), (6,"exF",26), (0,"exG",12), (13,"exH",53))
               .toDF("value", "col2", "col3")

scala> df.show(false)
+-----+----+----+
|value|col2|col3|
+-----+----+----+
|0    |exA |30  |
|0    |exB |22  |
|0    |exC |19  |
|16   |exD |13  |
|5    |exE |28  |
|6    |exF |26  |
|0    |exG |12  |
|13   |exH |53  |
+-----+----+----+

我期望从这个数据帧开始

scala> df.show(false)
+-----+----+----+
|value|col2|col3|
+-----+----+----+
|16   |exA |30  |    // Change the value 0 to 16 at value column
|16   |exB |22  |    // Change the value 0 to 16 at value column
|16   |exC |19  |    // Change the value 0 to 16 at value column
|16   |exD |13  |
|5    |exE |28  |
|6    |exF |26  |
|0    |exG |12  |    // value should not be change here
|13   |exH |53  |
+-----+----+----+

请帮助我解决这个问题。

3 个答案:

答案 0 :(得分:1)

您可以为此使用Window功能

 val df = Seq( (0,"exA",30), (0,"exB",22), (0,"exC",19), (16,"exD",13),
           (5,"exE",28), (6,"exF",26), (0,"exG",12), (13,"exH",53))
           .toDF("value", "col2", "col3")
 val w = Window.orderBy($"col2".desc)
 df.withColumn("Result", last(when($"value" === 0, null).otherwise($"value"), ignoreNulls = true).over(w))
  .orderBy($"col2")
  .show(10)

会导致

+-----+----+----+------+
|value|col2|col3|Result|
+-----+----+----+------+
|    0| exA|  30|    16|
|    0| exB|  22|    16|
|    0| exC|  19|    16|
|   16| exD|  13|    16|
|    5| exE|  28|     5|
|    6| exF|  26|     6|
|    0| exG|  12|    13|
|   13| exH|  53|    13|
+-----+----+----+------+

仅需使用表达式df.orderBy($"col2")才能以正确的顺序显示最终结果。如果您不关心最终订单,可以跳过它。

更新 要获得所需的确切信息,您应该使用一些更复杂的代码

val w = Window.orderBy($"col2")
val w2 = Window.orderBy($"col2".desc)
df.withColumn("IntermediateResult", first(when($"value" === 0, null).otherwise($"value"), ignoreNulls = true).over(w))
  .withColumn("Result", when($"IntermediateResult".isNull, last($"IntermediateResult", ignoreNulls = true).over(w2)).otherwise($"value"))
  .orderBy($"col2")
    .show(10)

+-----+----+----+------------------+------+
|value|col2|col3|IntermediateResult|Result|
+-----+----+----+------------------+------+
|    0| exA|  30|              null|    16|
|    0| exB|  22|              null|    16|
|    0| exC|  19|              null|    16|
|   16| exD|  13|                16|    16|
|    5| exE|  28|                16|     5|
|    6| exF|  26|                16|     6|
|    0| exG|  12|                16|     0|
|   13| exH|  53|                16|    13|
+-----+----+----+------------------+------+

答案 1 :(得分:0)

我认为您需要根据col2的顺序采用第一个非null或非零值。请在下面找到脚本。我在spark的内存中创建了一个表来编写sql。

val df = Seq( (0,"exA",30), (0,"exB",22), (0,"exC",19), (16,"exD",13),
               (5,"exE",28), (6,"exF",26), (0,"exG",12), (13,"exH",53))
               .toDF("value", "col2", "col3")
df.registerTempTable("table_df")
spark.sql("with cte as(select *,row_number() over(order by col2) rno from table_df) select case when value = 0 and rno<(select min(rno) from cte where value != 0) then (select value from cte where rno=(select min(rno) from cte where value != 0)) else value end value,col2,col3 from cte").show(df.count.toInt,false)

enter image description here

如有任何疑问,请告诉我。

答案 2 :(得分:-1)

我向DF添加了一个具有增量ID的新列

import org.apache.spark.sql.functions._    
val df_1 = Seq((0,"exA",30),
    (0,"exB",22), 
    (0,"exC",19), 
    (16,"exD",13),  
    (5,"exE",28), 
    (6,"exF",26), 
    (0,"exG",12), 
    (13,"exH",53))
    .toDF("value", "col2", "col3")
    .withColumn("UniqueID", monotonically_increasing_id)

过滤DF以使其具有非零值

val df_2 = df_1.filter("value != 0")

创建变量“ limit”以限制所需的前N行,并为第一个非零值限制变量Nvar

val limit = df_2.agg(min("UniqueID")).collect().map(_(0)).mkString("").toInt + 1
val nVal = df_1.limit(limit).agg(max("value")).collect().map(_(0)).mkString("").toInt

使用具有条件的相同名称(“值”)的列创建DF

val df_4 = df_1.withColumn("value", when(($"UniqueID" < limit), nVal).otherwise($"value"))