Pyspark,从包含空值的子集中删除行,保存它们然后再次添加

时间:2018-01-02 19:07:36

标签: apache-spark pyspark

所以在这里,基本上我有这样的东西:

   C1   C2   C3    C4
a   0    1    null  4
b   0    1    3     4
c   0    1    4     4
d   0    null 5     4

所以就删除而言,我已经完成了这样的工作并且有效:

sub=['C2','C3']
df = df.na.drop(subset=sub)

   C1   C2   C3   C4
b   0    1    3    4
c   0    1    4    4

但是现在我真的想将那些带有空值的行保存在另一个数据帧上,这样我可以稍后用一些函数添加它们。

Dataframe_of_nulls:
   C1   C2   C3   C4
a   0    1    null 4
d   0    null 5    4

随意忽略索引,它们只是让扩展不那么混乱。

1 个答案:

答案 0 :(得分:1)

您可以针对每种情况进行过滤:

from pyspark.sql.functions import col, lit
from operator import or_ 
from functools import reduce


def split_on_null(df, subset):
    any_null = reduce(or_, (col(c).isNull() for c in subset), lit(False))
    return df.where(any_null), df.where(~any_null)

用法:

df = spark.createDataFrame([
    (0, 1, None, 4), (0, 1, 3, 4), (0, 1, 4, 4), (0, None, 5, 4), 
    (0, 1, 3, 4), (0, None, 5, 4)]
).toDF("c1", "c2", "c3", "c4")

with_nulls, without_nulls = split_on_null(df, sub)
with_nulls.show()
+---+----+----+---+
| c1|  c2|  c3| c4|
+---+----+----+---+
|  0|   1|null|  4|
|  0|null|   5|  4|
|  0|null|   5|  4|
+---+----+----+---+
without_nulls.show()
+---+---+---+---+
| c1| c2| c3| c4|
+---+---+---+---+
|  0|  1|  3|  4|
|  0|  1|  4|  4|
|  0|  1|  3|  4|
+---+---+---+---+

替代解决方案是subtract

without_nulls_ = df.na.drop(subset=sub)
with_nulls_ = df.subtract(without_nulls_ )

但它更昂贵,不会保留重复:

without_nulls_.show()
+---+---+---+---+
| c1| c2| c3| c4|
+---+---+---+---+
|  0|  1|  3|  4|
|  0|  1|  4|  4|
|  0|  1|  3|  4|
+---+---+---+---+
with_nulls_.show()
+---+----+----+---+                                                             
| c1|  c2|  c3| c4|
+---+----+----+---+
|  0|null|   5|  4|
|  0|   1|null|  4|
+---+----+----+---+