所以在这里,基本上我有这样的东西:
C1 C2 C3 C4
a 0 1 null 4
b 0 1 3 4
c 0 1 4 4
d 0 null 5 4
所以就删除而言,我已经完成了这样的工作并且有效:
sub=['C2','C3']
df = df.na.drop(subset=sub)
C1 C2 C3 C4
b 0 1 3 4
c 0 1 4 4
但是现在我真的想将那些带有空值的行保存在另一个数据帧上,这样我可以稍后用一些函数添加它们。
Dataframe_of_nulls:
C1 C2 C3 C4
a 0 1 null 4
d 0 null 5 4
随意忽略索引,它们只是让扩展不那么混乱。
答案 0 :(得分:1)
您可以针对每种情况进行过滤:
from pyspark.sql.functions import col, lit
from operator import or_
from functools import reduce
def split_on_null(df, subset):
any_null = reduce(or_, (col(c).isNull() for c in subset), lit(False))
return df.where(any_null), df.where(~any_null)
用法:
df = spark.createDataFrame([
(0, 1, None, 4), (0, 1, 3, 4), (0, 1, 4, 4), (0, None, 5, 4),
(0, 1, 3, 4), (0, None, 5, 4)]
).toDF("c1", "c2", "c3", "c4")
with_nulls, without_nulls = split_on_null(df, sub)
with_nulls.show()
+---+----+----+---+
| c1| c2| c3| c4|
+---+----+----+---+
| 0| 1|null| 4|
| 0|null| 5| 4|
| 0|null| 5| 4|
+---+----+----+---+
without_nulls.show()
+---+---+---+---+
| c1| c2| c3| c4|
+---+---+---+---+
| 0| 1| 3| 4|
| 0| 1| 4| 4|
| 0| 1| 3| 4|
+---+---+---+---+
替代解决方案是subtract
:
without_nulls_ = df.na.drop(subset=sub)
with_nulls_ = df.subtract(without_nulls_ )
但它更昂贵,不会保留重复:
without_nulls_.show()
+---+---+---+---+
| c1| c2| c3| c4|
+---+---+---+---+
| 0| 1| 3| 4|
| 0| 1| 4| 4|
| 0| 1| 3| 4|
+---+---+---+---+
with_nulls_.show()
+---+----+----+---+
| c1| c2| c3| c4|
+---+----+----+---+
| 0|null| 5| 4|
| 0| 1|null| 4|
+---+----+----+---+