Question

我目前开始使用pyspark。我有一个两列数据框，其中一列包含一些空值，例如

df1
A             B
1a3b          7
0d4s          12
6w2r          null
6w2r          null
1p4e          null

和另一个数据帧具有正确的映射，即

df2
A             B
1a3b          7
0d4s          12
6w2r          0
1p4e          3

所以我想使用df1 s.t填写df2中的空值。结果是：

A             B
1a3b          7
0d4s          12
6w2r          0
6w2r          0
1p4e          3

在pandas中，我首先要从df2创建一个查找字典，然后在df1上使用apply来填充空值。但是我不确定在pyspark中使用什么函数，我看到的大多数替换空值都是基于简单条件，例如，将所有空值填充为某个列的单个常量值。

我试过的是：

from pyspark.sql.functions import when, col

df1.withColumn('B', when(df.B.isNull(), df2.where(df2.B== df1.B).select('A')))

虽然我得到AttributeError: 'DataFrame' object has no attribute '_get_object_id'。逻辑是首先过滤掉空值，然后将其替换为来自df2的列B的值，但我认为df.B.isNull()评估整个列而不是单个值，这可能不是正确的方法它，有什么建议吗？

Answer 1

在公共列A 上

左连接，选择合适的列可以获得所需的输出

df1.join(df2, df1.A == df2.A, 'left').select(df1.A, df2.B).show(truncate=False)

应该给你

+----+---+
|A   |B  |
+----+---+
|6w2r|0  |
|6w2r|0  |
|1a3b|7  |
|1p4e|3  |
|0d4s|12 |
+----+---+

如何根据另一个数据帧填充空值pyspark

1 个答案: