在数据框中动态查找Null值,并将其替换为下一个和上一个值的平均值

时间:2019-05-13 10:37:45

标签: scala apache-spark dataframe apache-spark-sql

我有一个带有5列的dataframe。我需要动态检查这些列中是否有空值,并用上一个和下一个值的平均值替换它,以期使用scala代码

我能够动态检查空值并将其过滤掉。但是我没有得到如何用平均值而不是空值更新

 val df = spark.createDataFrame(Seq(
     |   (1, Some(5), 2, "F"),
     |   (2, Some(2), 4, "F"),
     |   (3, None, 6, "N"),
     |   (4, Some(3),    8, "F")
     | )).toDF("ACCT_ID", "M_CD", "C_CD","IND")
df: org.apache.spark.sql.DataFrame = [ACCT_ID: int, M_CD: int ... 2 more fields]

创建了用于动态检查的过滤条件

val filterCond = df.columns.map(x=>col(x).isNotNull).reduce(_ && _)
filterCond: org.apache.spark.sql.Column = ((((ACCT_ID IS NOT NULL) AND (M_CD IS NOT NULL)) AND (C_CD IS NOT NULL)) AND (IND IS NOT NULL))

将其应用于数据框:

cala> val df1 = df.filter(filterCond)
df1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [ACCT_ID: int, M_CD: int ... 2 more fields]

scala> df1.show
+-------+----+----+---+
|ACCT_ID|M_CD|C_CD|IND|
+-------+----+----+---+
|      1|   5|   2|  F
       2|   2|   4|  F|
|      4|   3|   8|  F|
+-------+----+----+---+

我无法获得空行。无法执行逻辑来替换空值

我已经在上面更新了

i / p:

a1  a2  a3  a4  a5
1   5   8   9   10
?   6   8   2   3
5   4   6   ?   1
?   5   ?   6   4
4   2   3   4   4

输出:

a1  a2  a3  a4  a5
1   5   8   9   10
3   6   8   2   3
5   4   6   4   1
4   5   4   6   4
4   2   3   4   4

0 个答案:

没有答案