Question

我在Spark / Scala中有一个包含100列的数据框。许多其他列都有许多空值。我想找到超过90％空值的列，然后从我的数据帧中删除它们。我怎么能在Spark / Scala中做到这一点？

Answer 1

假设您有这样的数据框：

val df = Seq((Some(1.0), Some(2), Some("a")), 
             (null, Some(3), null), 
             (Some(2.0), Some(4), Some("b")), 
             (null, null, Some("c"))
            ).toDF("A", "B", "C")

df.show
+----+----+----+
|   A|   B|   C|
+----+----+----+
| 1.0|   2|   a|
|null|   3|null|
| 2.0|   4|   b|    
|null|null|   c|
+----+----+----+

使用agg函数计数NULL并根据空计数和阈值过滤列，在此处将其设置为1：

val null_thresh = 1                 // if you want to use percentage 
                                    // val null_thresh = df.count() * 0.9

val to_keep = df.columns.filter(
    c => df.agg(
        sum(when(df(c).isNull, 1).otherwise(0)).alias(c)
    ).first().getLong(0) <= null_thresh
)

df.select(to_keep.head, to_keep.tail: _*).show

你得到：

+----+----+
|   B|   C|
+----+----+
|   2|   a|
|   3|null|
|   4|   b|
|null|   c|
+----+----+

Answer 2

org.apache.spark.sql.functions.array和udf会有所帮助。

import spark.implicits._
import org.apache.spark.sql.functions._

val df = sc.parallelize[(String, String, String, String, String, String, String, String, String, String)](
  Seq(
    ("a", null, null, null, null, null, null, null, null, null), // 90%
    ("b", null, null, null, null, null, null, null, null, ""), // 80%
    ("c", null, null, null, null, null, null, null, "", "") // 70%
  )
).toDF("c1", "c2", "c3", "c4", "c5", "c6", "c7", "c8", "c9","c10")

// count nulls then check the condition
val check_90_null = udf { xs: Seq[String] =>
  xs.count(_ == null) >= (xs.length * 0.9)
}

// all columns as array
val columns = array(df.columns.map(col): _*)

// filter out
df.where(not(check_90_null(columns)))
  .show()

显示

+---+----+----+----+----+----+----+----+----+---+
| c1|  c2|  c3|  c4|  c5|  c6|  c7|  c8|  c9|c10|
+---+----+----+----+----+----+----+----+----+---+
|  b|null|null|null|null|null|null|null|null|   |
|  b|null|null|null|null|null|null|null|    |   |
+---+----+----+----+----+----+----+----+----+---+

排除了“a”行。

如何在Spark / Scala中查找包含许多空值的列

2 个答案: