我试图用同一列的非空值以上或以下替换一列中的Null或无效值。例如:-
Name|Place|row_count
a |a1 |1
a |a2 |2
a |a2 |3
|d1 |4
b |a2 |5
c |a2 |6
| |7
| |8
d |c1 |9
在这种情况下,我尝试替换“名称”列中的所有NULL值。第一个NULL将替换为“ a”,第二个NULL将替换为“ c”,在列“ Place”中将替换为“ a2” 。 当我们尝试替换“位置”列的第8个单元格NULL时,还要替换为其稀疏的非空值“ a2”。 必填结果: 如果我们选择替换“位置”列的第8个单元格NULL,则结果将为
Name|Place|row_count
a |a1 |1
a |a2 |2
a |a2 |3
|d1 |4
b |a2 |5
c |a2 |6
| |7
|a2 |8
d |c1 |9
如果我们选择“名称”列的第4个单元格NULL进行替换,那么结果将为
Name|Place|row_count
a |a1 |1
a |a2 |2
a |a2 |3
a |d1 |4
b |a2 |5
c |a2 |6
| |7
| |8
d |c1 |9
答案 0 :(得分:0)
Windows
函数将很容易解决该问题。为了简单起见,我只关注name
列。如果上一行具有null
,则使用下一行的值。您可以根据需要更改此顺序。其他列也需要执行相同的方法。
import spark.implicits._
import org.apache.spark.sql.functions._
val df = Seq(("a", "a1", "1"),
("a", "a2", "2"),
("a", "a2", "3"),
("d1", null, "4"),
("b", "a2", "5"),
("c", "a2", "6"),
(null, null, "7"),
(null, null, "8"),
("d", "c1", "9")).toDF("name", "place", "row_count")
val window = Window.orderBy("row_count")
val lagNameWindowExpression = lag('name, 1).over(window)
val leadNameWindowExpression = lead('name, 1).over(window)
val nameConditionExpression = when($"name".isNull.and('previous_name_col.isNull), 'next_name_col)
.when($"name".isNull.and('previous_name_col.isNotNull), 'previous_name_col).otherwise($"name")
df.select($"*", lagNameWindowExpression as 'previous_name_col, leadNameWindowExpression as 'next_name_col)
.withColumn("name", nameConditionExpression).drop("previous_name_col", "next_name_col")
.show(false)
输出
+----+-----+---------+
|name|place|row_count|
+----+-----+---------+
|a |a1 |1 |
|a |a2 |2 |
|a |a2 |3 |
|d1 |null |4 |
|b |a2 |5 |
|c |a2 |6 |
|c |null |7 |
|d |null |8 |
|d |c1 |9 |
+----+-----+---------+