如何计算数据框中每行的缺失数据

时间:2017-07-11 16:57:30

标签: python pyspark

我得到了这个数据帧示例:

from pyspark.sql.types import *

schema = StructType([
StructField("ClientId", IntegerType(), True),
StructField("m_ant21", IntegerType(), True),
StructField("m_ant22", IntegerType(), True),
StructField("m_ant23", IntegerType(), True),
StructField("m_ant24", IntegerType(), True)])

df = sqlContext.createDataFrame(
                         data=[(0, None, None, None, None),
                               (1, 23, 13, 17, 99),
                               (2, 0, 0, 0, 1),
                               (3, 0, None, 1, 0),
                               (4, None, None, None, None)],
                               schema=schema)

我有这个数据框:

 +--------+-------+-------+-------+-------+
 |ClientId|m_ant21|m_ant22|m_ant23|m_ant24|
 +--------+-------+-------+-------+-------+
 |       0|   null|   null|   null|   null|
 |       1|     23|     13|     17|     99|
 |       2|      0|      0|      0|      1|
 |       3|      0|   null|      1|      0|
 |       4|   null|   null|   null|   null|
 +--------+-------+-------+-------+-------+

我需要解决这个问题: 我想创建一个新变量来计算每行有多少空值。例如:

  • ClientId 0应为4
  • ClientId 1应为0
  • ClientId 3应为1

请注意,df是一个pyspark.sql.dataframe.DataFrame。

1 个答案:

答案 0 :(得分:2)

这是一个选项:

from pyspark.sql import Row

# add the column schema to the original schema
schema.add(StructField("count_null", IntegerType(), True))

# convert data frame to rdd and append an element to each row to count the number of nulls
df.rdd.map(lambda row: row + Row(sum(x is None for x in row))).toDF(schema).show()

+--------+-------+-------+-------+-------+----------+
|ClientId|m_ant21|m_ant22|m_ant23|m_ant24|count_null|
+--------+-------+-------+-------+-------+----------+
|       0|   null|   null|   null|   null|         4|
|       1|     23|     13|     17|     99|         0|
|       2|      0|      0|      0|      1|         0|
|       3|      0|   null|      1|      0|         1|
|       4|   null|   null|   null|   null|         4|
+--------+-------+-------+-------+-------+----------+

如果您不想处理架构,可以选择以下方法:

from pyspark.sql.functions import col, when

df.withColumn("count_null", sum([when(col(x).isNull(),1).otherwise(0) for x in df.columns])).show()

+--------+-------+-------+-------+-------+----------+
|ClientId|m_ant21|m_ant22|m_ant23|m_ant24|count_null|
+--------+-------+-------+-------+-------+----------+
|       0|   null|   null|   null|   null|         4|
|       1|     23|     13|     17|     99|         0|
|       2|      0|      0|      0|      1|         0|
|       3|      0|   null|      1|      0|         1|
|       4|   null|   null|   null|   null|         4|
+--------+-------+-------+-------+-------+----------+