Pyspark数据帧如何在所有列中删除带空值的行?

时间:2018-01-12 15:05:48

标签: python apache-spark pyspark apache-spark-sql pyspark-sql

对于数据框,在它之前:

+----+----+----+
|  ID|TYPE|CODE|
+----+----+----+
|   1|   B|  X1|
|null|null|null|
|null|   B|  X1|
+----+----+----+

我希望它之后:

+----+----+----+
|  ID|TYPE|CODE|
+----+----+----+
|   1|   B|  X1|
|null|   B|  X1|
+----+----+----+

我更喜欢一种通用方法,以便在df.columns很长时可以应用。 谢谢!

3 个答案:

答案 0 :(得分:13)

na.drop提供策略就是您所需要的:

df = spark.createDataFrame([
    (1, "B", "X1"), (None, None, None), (None, "B", "X1"), (None, "C", None)],
    ("ID", "TYPE", "CODE")
)

df.na.drop(how="all").show()
+----+----+----+
|  ID|TYPE|CODE|
+----+----+----+  
|   1|   B|  X1|
|null|   B|  X1|
|null|   C|null|
+----+----+----+

可以使用thresholdNOT NULL个数量)来实现替代配方:

df.na.drop(thresh=1).show()
+----+----+----+
|  ID|TYPE|CODE|
+----+----+----+
|   1|   B|  X1|
|null|   B|  X1|
|null|   C|null|
+----+----+----+

答案 1 :(得分:4)

一种选择是使用functools.reduce来构建条件:

from functools import reduce
df.filter(~reduce(lambda x, y: x & y, [df[c].isNull() for c in df.columns])).show()
+----+----+----+
|  ID|TYPE|CODE|
+----+----+----+
|   1|   B|  X1|
|null|   B|  X1|
+----+----+----+

其中reduce按如下方式生成查询:

~reduce(lambda x, y: x & y, [df[c].isNull() for c in df.columns])
# Column<b'(NOT (((ID IS NULL) AND (TYPE IS NULL)) AND (CODE IS NULL)))'>

答案 2 :(得分:0)

您可以尝试一下。

df=df.dropna(how='all')