我在Spark / Scala中有一个包含100列的数据框。许多其他列都有许多空值。我想找到超过90%空值的列,然后从我的数据帧中删除它们。我怎么能在Spark / Scala中做到这一点?
答案 0 :(得分:2)
假设您有这样的数据框:
val df = Seq((Some(1.0), Some(2), Some("a")),
(null, Some(3), null),
(Some(2.0), Some(4), Some("b")),
(null, null, Some("c"))
).toDF("A", "B", "C")
df.show
+----+----+----+
| A| B| C|
+----+----+----+
| 1.0| 2| a|
|null| 3|null|
| 2.0| 4| b|
|null|null| c|
+----+----+----+
使用agg
函数计数NULL并根据空计数和阈值过滤列,在此处将其设置为1:
val null_thresh = 1 // if you want to use percentage
// val null_thresh = df.count() * 0.9
val to_keep = df.columns.filter(
c => df.agg(
sum(when(df(c).isNull, 1).otherwise(0)).alias(c)
).first().getLong(0) <= null_thresh
)
df.select(to_keep.head, to_keep.tail: _*).show
你得到:
+----+----+
| B| C|
+----+----+
| 2| a|
| 3|null|
| 4| b|
|null| c|
+----+----+
答案 1 :(得分:2)
org.apache.spark.sql.functions.array
和udf
会有所帮助。
import spark.implicits._
import org.apache.spark.sql.functions._
val df = sc.parallelize[(String, String, String, String, String, String, String, String, String, String)](
Seq(
("a", null, null, null, null, null, null, null, null, null), // 90%
("b", null, null, null, null, null, null, null, null, ""), // 80%
("c", null, null, null, null, null, null, null, "", "") // 70%
)
).toDF("c1", "c2", "c3", "c4", "c5", "c6", "c7", "c8", "c9","c10")
// count nulls then check the condition
val check_90_null = udf { xs: Seq[String] =>
xs.count(_ == null) >= (xs.length * 0.9)
}
// all columns as array
val columns = array(df.columns.map(col): _*)
// filter out
df.where(not(check_90_null(columns)))
.show()
显示
+---+----+----+----+----+----+----+----+----+---+
| c1| c2| c3| c4| c5| c6| c7| c8| c9|c10|
+---+----+----+----+----+----+----+----+----+---+
| b|null|null|null|null|null|null|null|null| |
| b|null|null|null|null|null|null|null| | |
+---+----+----+----+----+----+----+----+----+---+
排除了“a”行。