我有n列的数据框,我想计算每列中缺失值的数量。
我使用以下代码片段执行此操作,但是输出不是我期望的:
for (e <- df.columns) {
var c: Int = df.filter( df(e).isNull || df(e) === "" || df(e).isNaN ||
df(e) === "-" || df(e) === "NA").count()
println(e+":"+c)
}
输出:
column1:
column2:
column3:
如何根据摘要中所述的逻辑正确获取缺失值的计数?
答案 0 :(得分:1)
您可以用稍微不同的方式来做。
import org.apache.spark.sql.functions._
val df = List[(Integer, Integer, Integer)]((1, null, null),(null, 2,3), (null, 3, null)).toDF("a", "b", "c")
df.select(df.columns.map(c => count(predicate(col(c))).as(s"nulls in column $c")): _*).show()
private def predicate(c: Column) = {
c.isNull || c === "" || c.isNaN || c === "-" || c === "NA"
}
此代码将产生:
+-----------------+-----------------+-----------------+
|nulls in column a|nulls in column b|nulls in column c|
+-----------------+-----------------+-----------------+
| 2| 1| 2|
+-----------------+-----------------+-----------------+
答案 1 :(得分:0)
!std::isinf(x) && !std::isnan(x)