从数据框中我想获取内部至少包含一个空值的列的名称。
考虑以下数据框:
val dataset = sparkSession.createDataFrame(Seq(
(7, null, 18, 1.0),
(8, "CA", null, 0.0),
(9, "NZ", 15, 0.0)
)).toDF("id", "country", "hour", "clicked")
我希望列名为“Country”和“Hour”。
id country hour clicked
7 null 18 1
8 "CA" null 0
9 "NZ" 15 0
答案 0 :(得分:2)
这是一个解决方案,但它有点尴尬,我希望有一个更简单的方法:
val cols = dataset.columns
val columnsToSelect = dataset
// count null values (by summing up 1s if its null)
.select(cols.map(c => (sum(when(col(c).isNull,1))>0).alias(c)):_*)
.head() // collect result of aggregation
.getValuesMap[Boolean](cols) // now get columns which are "true"
.filter{case (c,hasNulls) => hasNulls}
.keys.toSeq // and get the name of those columns
dataset
.select(columnsToSelect.head,columnsToSelect.tail:_*)
.show()
+-------+----+
|country|hour|
+-------+----+
| null| 18|
| CA|null|
| NZ| 15|
+-------+----+
答案 1 :(得分:1)
略微修改this answer,将每列的计数与行数进行比较:
import org.apache.spark.sql.functions.{count,col}
// Get number of rows
val nr_rows = dataset.count
// Get column indices
val col_inds = dataset.select(dataset.columns.map(c => count(col(c)).alias(c)): _*)
.collect()(0)
.toSeq.zipWithIndex
.filter(_._1 != nr_rows).map(_._2)
// Subset column names using the indices
col_inds.map(i => dataset.columns.apply(i))
Seq[String] = ArrayBuffer(country, hour)