我有一个大的JSON文件,其中包含432个键值对以及许多此类数据行。该数据加载得很好,但是当我想使用df.show()显示20个项目时,我看到了一堆空值。该文件非常稀疏。用它很难做点什么。最好删除20行中只有空值的列,但是,鉴于我有很多键值对,因此很难手动进行。有没有一种方法可以在Spark的数据框中检测哪些列仅包含空值并将其删除?
答案 0 :(得分:1)
您可以尝试以下操作,以获取更多信息,referred_question
scala> val df = Seq((1,2,null),(3,4,null),(5,6,null),(7,8,"9")).toDF("a","b","c")
scala> df.show
+---+---+----+
| a| b| c|
+---+---+----+
| 1| 2|null|
| 3| 4|null|
| 5| 6|null|
| 7| 8| 9|
+---+---+----+
scala> val dfl = df.limit(3) //limiting the number of rows you need, in your case it is 20
scala> val col_names = dfl.select(dfl.columns.map(x => count(col(x)).alias(x)):_*).first.toSeq.zipWithIndex.filter(x => x._1.toString.toInt > 0).map(_._2).map(x => dfl.columns(x)).map(x => col(x)) // this will give you column names which is having not null values
col_names: Seq[org.apache.spark.sql.Column] = ArrayBuffer(a, b)
scala> dfl.select(col_names : _*).show
+---+---+
| a| b|
+---+---+
| 1| 2|
| 3| 4|
| 5| 6|
+---+---+
让我知道它是否对您有用。
答案 1 :(得分:1)
类似于Sathiyan的想法,但是在count()本身中使用了列名。
scala> val df = Seq((1,2,null),(3,4,null),(5,6,null)).toDF("a","b","c")
df: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]
scala> df.show
+---+---+----+
| a| b| c|
+---+---+----+
| 1| 2|null|
| 3| 4|null|
| 5| 6|null|
+---+---+----+
scala> val notnull_cols = df.select(df.columns.map(x=>concat_ws("=",first(lit(x)),count(col(x)))):_*).first.toSeq.map(_.toString).filter(!_.contains("=0")).map( x=>col(x.split("=")(0)) )
notnull_cols: Seq[org.apache.spark.sql.Column] = ArrayBuffer(a, b)
scala> df.select(notnull_cols:_*).show
+---+---+
| a| b|
+---+---+
| 1| 2|
| 3| 4|
| 5| 6|
+---+---+
中间结果显示计数和列名
scala> df.select(df.columns.map(x=>concat_ws("=",first(lit(x)),count(col(x))).as(x+"_nullcount")):_*).show
+-----------+-----------+-----------+
|a_nullcount|b_nullcount|c_nullcount|
+-----------+-----------+-----------+
| a=3| b=3| c=0|
+-----------+---------- -+-----------+
scala>