Question

我想合并两个看起来像这样的数据框：

In[14]: test1=pd.DataFrame({'col1':[1,2,3,
                                    6,4,5],
                            'col2':['First','Second','Third',
                                    'Sixth','Fourth','Fifth']})
test1
Out[14]:

   col1    col2
0     1   First
1     2  Second
2     3   Third
3     6   Sixth
4     4  Fourth
5     5   Fifth

和

In[15]: test2=pd.DataFrame({'col1':[1,7,2,
                                    3,4,5],
                            'col2':['First','Seventh','Second',
                                    'Third','Fourth','Fifth']})
test2
Out[15]: 

   col1     col2
0     1    First
1     7  Seventh
2     2   Second
3     3    Third
4     4   Fourth
5     5    Fifth

您可能会注意到，这些DataFrame几乎相同，但每个都有一个额外的行不在另一个行中（3 6 Sixth test1和1 7 Seventh test2）。

我希望以这样的方式合并这些DataFrame，即将一个DataFrame中的任何额外行插入到另一个DataFrame中尽可能接近其原始位置。这是我希望得到的结果：

   col1     col2
0     1    First
1     7  Seventh
2     2   Second
3     3    Third
4     6    Sixth
5     4   Fourth
6     5    Fifth

我尝试使用

In[16]: pd.merge(test1, test2, how='outer', sort=False)

此输出

Out[16]: 

   col1     col2
0     1    First
1     2   Second
2     3    Third
3     6    Sixth
4     4   Fourth
5     5    Fifth
6     7  Seventh

如您所见，test2的第二行现在位于底部。调用pd.merge(test2, test1, how='outer', sort=False)会产生类似的结果，但底部会显示第test1行。坚持两个DataFrame中的条目顺序对我来说至关重要，所以这不是我想要的。

我还尝试了update()，combine_first()和replace()，但他们提供内部或左侧联接。

如何让pandas做我想做的事？

Answer 1

您可以使用concat，然后使用drop_duplicates和sort_index：

<div class="parallaxmenu" id="parallax-4" data-parallax="scroll" data-image-src="../pictures/drinks.png">

结果输出：

df = pd.concat([test2, test1]).drop_duplicates().sort_index()

如果您希望新DataFrame的索引是唯一的，请在结尾处执行reset_index：

   col1     col2
0     1    First
1     7  Seventh
2     2   Second
3     3    Third
3     6    Sixth
4     4   Fourth
5     5    Fifth

这给出了一个独特的索引：

df = pd.concat([test2, test1]).drop_duplicates().sort_index().reset_index(drop=True)

Answer 2

您只需在每个数据集中创建一个fake索引，并按此索引对结果匹配的数据框进行排序

test1['index_fake'] = test1.index
test2['index_fake'] = test2.index

full_df = pd.merge(test1, test2, how='outer', sort=False)

full_df.sort_values(by = 'index_fake')

Answer 3

如何更改test2的列名？

test2=pd.DataFrame({'col1':[1,7,2,3,4,5],
                    'col2a':['First','Seventh','Second',
                    'Third','Fourth','Fifth']})

然后执行您在问题中显示的合并

test3 = pd.merge(test1, test2, how='outer', sort=False)

但现在，您可以填写缺失的数据并删除额外的列

test3.col2.fillna(test3.col2a, inplace=True)
test3.drop('col2a', axis=1, inplace=True)

以下是结果

   col1     col2
0   1.0    First
1   2.0   Second
2   3.0    Third
3   6.0    Sixth
4   4.0   Fourth
5   5.0    Fifth
6   7.0  Seventh

Pandas外部合并了同一个DataFrame的两个版本

3 个答案:

如何更改test2的列名？

然后执行您在问题中显示的合并

但现在，您可以填写缺失的数据并删除额外的列

以下是结果