Question

我需要合并两个数据框而不创建重复的列。第一个datframe（dfa）缺少值。第二个数据帧（dfb）具有唯一值。这将与Excel中的vlookup相同。

dfa看起来像这样：

postcode  lat  lon ...plus 32 more columns
M20       2.3  0.2
LS1       NaN  NaN
LS1       NaN  NaN
LS2       NaN  NaN
M21       2.4  0.3

dfb仅包含唯一的邮政编码和值，其中lat和lon是dfa中的NaN。看起来像这样：

postcode  lat  lon 
LS1       1.4  0.1
LS2       1.5  0.2

我想要的输出是：

postcode  lat  lon ...plus 32 more columns
M20       2.3  0.2
LS1       1.4  0.1
LS1       1.4  0.1
LS2       1.5  0.2
M21       2.4  0.3

我尝试像这样使用pd.merge：

outputdf = pd.merge(dfa, dfb, on='Postcode', how='left')

这将导致创建重复的列：

postcode  lat_x  lon_x  lat_y  lat_x ...plus 32 more columns
M20       2.3    0.2    NaN    NaN
LS1       NaN    NaN    1.4    0.1
LS1       NaN    NaN    1.4    0.1
LS2       NaN    NaN    1.5    0.2
M21       2.4    0.3    NaN    NaN

在this answer中，我尝试使用：

output = dfa
for df in [dfa, dfb]:
    ouput.update(df.set_index('Postcode'))

但是收到“ ValueError：无法从重复的轴重新索引”。

也根据上述答案，这无效起作用：

output.merge(pd.concat([dfa, dfb]), how='left')

没有重复的列，但“ Lat”和“ Lon”中的值仍为空白。

有没有一种方法可以在“邮政编码”上合并而不创建重复的列；使用熊猫有效地执行VLOOKUP？

Answer 1

在两个DataFrame中将DataFrame.combine_first与带有postcode的索引一起使用，然后如有必要，为相同的列顺序添加DataFrame.reindex，例如原始df1：

print (df1)
  postcode  lat  lon  plus  32  more  columns
0      M20  2.3  0.2   NaN NaN   NaN      NaN
1      LS1  NaN  NaN   NaN NaN   NaN      NaN
2      LS1  NaN  NaN   NaN NaN   NaN      NaN
3      LS2  NaN  NaN   NaN NaN   NaN      NaN
4      M21  2.4  0.3   NaN NaN   NaN      NaN

df1 = df1.set_index('postcode')
df2 = df2.set_index('postcode')

df3 = df1.combine_first(df2).reindex(df1.columns, axis=1)
print (df3)
          lat  lon  plus  32  more  columns
postcode                                   
LS1       1.4  0.1   NaN NaN   NaN      NaN
LS1       1.4  0.1   NaN NaN   NaN      NaN
LS2       1.5  0.2   NaN NaN   NaN      NaN
M20       2.3  0.2   NaN NaN   NaN      NaN
M21       2.4  0.3   NaN NaN   NaN      NaN

Answer 2

DataFrame.combine_first(self, other)似乎是最好的解决方案。

如果您只需要一行代码并且不想更改输入数据框：

 df1.set_index('postcode').combine_first(df2.set_index('postcode'))

，以及是否需要从df1保留索引：

df1.reset_index().set_index('postcode').combine_first(df2.set_index('postcode')).reset_index().set_index('index').sort_index()

不是优雅，但是可以。

熊猫合并而不重复列

2 个答案: