如何仅显示Spark的DataFrame中的相关列?

时间:2018-12-07 03:37:45

标签: scala apache-spark-sql

我有一个大的JSON文件,其中包含432个键值对以及许多此类数据行。该数据加载得很好,但是当我想使用df.show()显示20个项目时,我看到了一堆空值。该文件非常稀疏。用它很难做点什么。最好删除20行中只有空值的列,但是,鉴于我有很多键值对,因此很难手动进行。有没有一种方法可以在Spark的数据框中检测哪些列仅包含空值并将其删除?

2 个答案:

答案 0 :(得分:1)

您可以尝试以下操作,以获取更多信息,referred_question

scala> val df = Seq((1,2,null),(3,4,null),(5,6,null),(7,8,"9")).toDF("a","b","c")

scala> df.show
+---+---+----+
|  a|  b|   c|
+---+---+----+
|  1|  2|null|
|  3|  4|null|
|  5|  6|null|
|  7|  8|   9|
+---+---+----+

scala> val dfl = df.limit(3) //limiting the number of rows you need, in your case it is 20

scala> val col_names = dfl.select(dfl.columns.map(x => count(col(x)).alias(x)):_*).first.toSeq.zipWithIndex.filter(x => x._1.toString.toInt > 0).map(_._2).map(x => dfl.columns(x)).map(x => col(x)) // this will give you column names which is having not null values
col_names: Seq[org.apache.spark.sql.Column] = ArrayBuffer(a, b)

scala> dfl.select(col_names : _*).show
+---+---+
|  a|  b|
+---+---+
|  1|  2|
|  3|  4|
|  5|  6|
+---+---+

让我知道它是否对您有用。

答案 1 :(得分:1)

类似于Sathiyan的想法,但是在count()本身中使用了列名。

scala>  val df = Seq((1,2,null),(3,4,null),(5,6,null)).toDF("a","b","c")
df: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]

scala> df.show
+---+---+----+
|  a|  b|   c|
+---+---+----+
|  1|  2|null|
|  3|  4|null|
|  5|  6|null|
+---+---+----+


scala> val notnull_cols = df.select(df.columns.map(x=>concat_ws("=",first(lit(x)),count(col(x)))):_*).first.toSeq.map(_.toString).filter(!_.contains("=0")).map( x=>col(x.split("=")(0)) )
notnull_cols: Seq[org.apache.spark.sql.Column] = ArrayBuffer(a, b)

scala> df.select(notnull_cols:_*).show
+---+---+
|  a|  b|
+---+---+
|  1|  2|
|  3|  4|
|  5|  6|
+---+---+

中间结果显示计数和列名

scala> df.select(df.columns.map(x=>concat_ws("=",first(lit(x)),count(col(x))).as(x+"_nullcount")):_*).show
+-----------+-----------+-----------+
|a_nullcount|b_nullcount|c_nullcount|
+-----------+-----------+-----------+
|        a=3|        b=3|        c=0|
+-----------+---------- -+-----------+


scala>