如何计算pyspark数据帧每一列中的null,na和nan值

时间:2019-05-12 01:04:43

标签: apache-spark pyspark apache-spark-sql

数据帧为na,Nan和Null值。 模式(名称:字符串,角色编号:整数,部门:字符串 示例:

Name  Rol.No  Dept
priya  345     cse
James  NA       Nan
Null   567      NULL

有关列名以及null,na和nan值的计数的预期输出

Name 1
Rol.No 1
Dept 2

1 个答案:

答案 0 :(得分:2)

使用when()


spark.version
'2.3.2'

import numpy as np
import pyspark.sql.functions as F
import pyspark.sql.types as T

schema = T.StructType([\
                          T.StructField("Name", T.StringType(), True),
                          T.StructField("RolNo", T.StringType(), True),
                          T.StructField("Dept", T.StringType(), True),
                          ])


rows = sc.parallelize([("priy", "345", "cse"),\
                            ("james", "NA", np.nan),\
                            (None, "567", "NULL")])

myDF = spark.createDataFrame(rows, schema)

myDF.show()
+-----+-----+----+
| Name|RolNo|Dept|
+-----+-----+----+
| priy|  345| cse|
|james|   NA| NaN|
| null|  567|NULL|
+-----+-----+----+

# gives you a count of nans, nulls, specific string values, etc for each col
myDF = myDF.select([F.count(F.when(F.isnan(i) | \
                                   F.col(i).contains('NA') | \
                                   F.col(i).contains('NULL') | \
                                   F.col(i).isNull(), i)).alias(i) \
                    for i in myDF.columns])

myDF.show()
+----+-----+----+
|Name|RolNo|Dept|
+----+-----+----+
|   1|    1|   2|
+----+-----+----+