我有一个带有5列的dataframe
。我需要动态检查这些列中是否有空值,并用上一个和下一个值的平均值替换它,以期使用scala代码
我能够动态检查空值并将其过滤掉。但是我没有得到如何用平均值而不是空值更新
val df = spark.createDataFrame(Seq(
| (1, Some(5), 2, "F"),
| (2, Some(2), 4, "F"),
| (3, None, 6, "N"),
| (4, Some(3), 8, "F")
| )).toDF("ACCT_ID", "M_CD", "C_CD","IND")
df: org.apache.spark.sql.DataFrame = [ACCT_ID: int, M_CD: int ... 2 more fields]
创建了用于动态检查的过滤条件
val filterCond = df.columns.map(x=>col(x).isNotNull).reduce(_ && _)
filterCond: org.apache.spark.sql.Column = ((((ACCT_ID IS NOT NULL) AND (M_CD IS NOT NULL)) AND (C_CD IS NOT NULL)) AND (IND IS NOT NULL))
将其应用于数据框:
cala> val df1 = df.filter(filterCond)
df1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [ACCT_ID: int, M_CD: int ... 2 more fields]
scala> df1.show
+-------+----+----+---+
|ACCT_ID|M_CD|C_CD|IND|
+-------+----+----+---+
| 1| 5| 2| F
2| 2| 4| F|
| 4| 3| 8| F|
+-------+----+----+---+
我无法获得空行。无法执行逻辑来替换空值
我已经在上面更新了
i / p:
a1 a2 a3 a4 a5
1 5 8 9 10
? 6 8 2 3
5 4 6 ? 1
? 5 ? 6 4
4 2 3 4 4
输出:
a1 a2 a3 a4 a5
1 5 8 9 10
3 6 8 2 3
5 4 6 4 1
4 5 4 6 4
4 2 3 4 4