Question

有两个数据帧：具有相同架构的df1和df2。 ID是主键。

我需要合并两个df1和df2。 union可以执行此操作，但有一项特殊要求：如果df1和df2中存在重复的具有相同ID的行。我需要将其保留在df1中。

df1：

ID col1 col2
1  AA   2019
2  B    2018

df2：

ID col1 col2
1  A    2019
3  C    2017

我需要以下输出：

df1：

ID col1 col2
1  AA   2019
2  B    2018
3  C    2017

如何执行此操作？谢谢。我认为可以注册两个tmp表，进行完全连接并使用coalesce。但我不喜欢这种方式，因为实际上大约有40列，而不是上面示例中的3列。

Answer 1

鉴于两个DataFrame具有相同的架构，您可以将df1与left_anti和df2的{{1}}连接进行联合：

df1

Answer 2

一种执行此操作的方法是，union使用带有指定数据帧的标识符列的数据帧，然后将其用于通过df1之类的功能对row_number中的行进行优先级排序。 / p>

此处显示的PySpark SQL解决方案。

from pyspark.sql.functions import lit,row_number,when
from pyspark.sql import Window
df1_with_identifier = df1.withColumn('identifier',lit('df1'))
df2_with_identifier = df2.withColumn('identifier',lit('df2'))
merged_df = df1_with_identifier.union(df2_with_identifier)
#Define the Window with the desired ordering
w = Window.partitionBy(merged_df.id).orderBy(when(merged_df.identifier == 'df1',1).otherwise(2))
result = merged_df.withColumn('rownum',row_number().over(w))
result.select(result.rownum == 1).show()

在left join上带有df1的解决方案可能要简单得多，除了您必须编写多个coalesce。

spark：合并两个数据帧，如果ID在两个数据帧中重复，则df1中的行将覆盖df2中的行

2 个答案: