Question

我有两个DataFrame，我想在列上合并＆＃34; Id＆＃34;

df1：

Id   Reputation
 1     10
 3     5
 4     40

df2：

Id   Reputation
 1     10
 2     5
 3     5
 6     55

我希望输出为：

dfOutput：

Id    Reputation
1       10
2       5
3       5
4       40
6       55

我希望保留df的所有值，但将重复值合并为一个。我知道我必须使用merge（）函数，但我不知道要传递什么参数。

Answer 1

您可以concatenate the DataFrames，groupby Id，然后通过获取每个组中的第一项进行汇总。

In [62]: pd.concat([df1,df2]).groupby('Id').first()
Out[62]: 
    Reputation
Id            
1           10
2            5
3            5
4           40
6           55

[5 rows x 1 columns]

或者，要将Id保留为列而不是索引，请使用as_index=False：

In [68]: pd.concat([df1,df2]).groupby('Id', as_index=False).first()
Out[68]: 
   Id  Reputation
0   1          10
1   2           5
2   3           5
3   4          40
4   6          55

[5 rows x 2 columns]

KarlD。提出一个好主意;使用combine_first：

In [99]: df1.set_index('Id').combine_first(df2.set_index('Id')).reset_index()
Out[99]: 
   Id  Reputation
0   1          10
1   2           5
2   3           5
3   4          40
4   6          55

[5 rows x 2 columns]

对于大型DataFrame，此解决方案似乎更快：

import pandas as pd
import numpy as np

N = 10**6
df1 = pd.DataFrame({'Id':np.arange(N), 'Reputation': np.random.randint(5, size=N)})
df2 = pd.DataFrame({'Id':np.arange(10, 10+N), 'Reputation':np.random.randint(5, size=N)})

In [95]: %timeit df1.set_index('Id').combine_first(df2.set_index('Id')).reset_index()
10 loops, best of 3: 174 ms per loop

In [96]: %timeit pd.concat([df1,df2]).groupby('Id', as_index=False).first()
1 loops, best of 3: 221 ms per loop

如何使用merge函数合并两个DataFrame中的常用值？

1 个答案: