我想知道Spark Scala中数据帧每一列的缺失值计数。
示例输出
文件头:col1missigcount:2,col2misscount:1,col3misscount:2
我的代码
val samplesqlDF = spark.sql("SELECT * FROM sampletable")
samplesqlDF.show()
样本表数据:
|----------------------------------------------------------------|
| name | age | degree |
| ---------------------------------------------------------------|
| ram | | MCA |
| | 25 | |
| | 26 | BE |
| Suganya | 24 | |
-----------------------------------------------------------------
答案 0 :(得分:0)
如果您不将字符串列的空格强制转换为null,则可以使用以下方法
scala> val df = Seq(("ram"," ","MCA"),("","25",""),("","26","BE"),("Suganya","24","")).toDF("name","age","degree")
df: org.apache.spark.sql.DataFrame = [name: string, age: string ... 1 more field]
scala> val df2 = df.withColumn("age",'age.cast("int"))
df2: org.apache.spark.sql.DataFrame = [name: string, age: int ... 1 more field]
scala> df2.show
+-------+----+------+
| name| age|degree|
+-------+----+------+
| ram|null| MCA|
| | 25| |
| | 26| BE|
|Suganya| 24| |
+-------+----+------+
scala> df2.agg(sum(when('age.isNull,1).otherwise(0)).as("agec"), sum(when('name==="",1).otherwise(0)).as("namec"),sum(when('degree==="",1).otherwise(0)).as("degreec")).show
+----+-----+-------+
|agec|namec|degreec|
+----+-----+-------+
| 1| 2| 2|
+----+-----+-------+
scala>
答案 1 :(得分:0)
使用df.columns
获取数据框中的列,然后使用数据框函数,例如col()
,agg()
,sum()
。
import org.apache.spark.sql.functions._
scala> val df = Seq(("ram"," ","MCA"),("","25",""),("","26","BE"),("Suganya","24","")).toDF("name","age","degree")
df: org.apache.spark.sql.DataFrame = [name: string, age: string ... 1 more field]
// You can get all columns names in an array
scala> df.columns
res12: Array[String] = Array(name, age, degree)
// Now map through all column names creating an sum-expression for each column.
scala> val aggCols = df.columns.map(colName =>
// Create a sum column, with conditions as per your requirement.
sum(when(col(colName).isNull
|| col(colName) === ""
|| col(colName) === " ",1).otherwise(0)
// Alias each column by appending "_c"
).as(colName + "_c"))
aggCols: Array[org.apache.spark.sql.Column] = Array(sum(CASE WHEN (((name IS NULL) OR (name = )) OR (name = )) THEN 1 ELSE 0 END) AS `name_c`, sum(CASE WHEN (((age IS NULL) OR (age = )) OR (age = )) THEN 1 ELSE 0 END) AS `age_c`, sum(CASE WHEN (((degree IS NULL) OR (degree = )) OR (degree = )) THEN 1 ELSE 0 END) AS `degree_c`)
// Use agg function and apply the array of sum-expressions.
scala> df.agg(aggCols.head, aggCols.tail: _*).show
+------+-----+--------+
|name_c|age_c|degree_c|
+------+-----+--------+
| 2| 1| 2|
+------+-----+--------+
您可能还会看到df.schema
比df.column
具有更多的元数据。
scala> df.schema
res14: org.apache.spark.sql.types.StructType = StructType(StructField(name,StringType,true), StructField(age,StringType,true), StructField(degree,StringType,true))