我在pandas中有两个数据帧,如下所示。 EmpID是两个数据帧的主键。
$roles=Illuminate\Support\Collection Object
(
[items:protected] => Array
(
[1] => admin
[12] => asdasd
[13] => asdfgf
[4] => manager
[7] => NewRole
[8] => NewRole12
[9] => NewRole13
[10] => NewRole14
[11] => NewRole15
[5] => supervisor
[6] => Tester
[2] => user
[3] => User2
)
)
我想用EmpID加入这两个数据帧,以便
我使用下面的代码来实现这一目标。
df_first = pd.DataFrame([[1, 'A',1000], [2, 'B',np.NaN],[3,np.NaN,3000],[4, 'D',8000],[5, 'E',6000]], columns=['EmpID', 'Name','Salary'])
df_second = pd.DataFrame([[1, 'A','HR','Delhi'], [8, 'B','Admin','Mumbai'],[3,'C','Finance',np.NaN],[9, 'D','Ops','Banglore'],[5, 'E','Programming',np.NaN],[10, 'K','Analytics','Mumbai']], columns=['EmpID', 'Name','Department','Location'])
但是这段代码给了我不想要的重复列,所以我只使用两个表中的唯一列进行合并。
merged_df = pd.merge(df_first,df_second,how='outer',on=['EmpID'])
现在我没有获得重复的列,但是在关键匹配的观察中没有获得值。
如果有人能帮助我,我真的很感激。
此致 Kailash Negi
答案 0 :(得分:1)
似乎您需要combine_first
set_index
来匹配由列EmpID
创建的索引:
df = df_first.set_index('EmpID').combine_first(df_second.set_index('EmpID')).reset_index()
print (df)
EmpID Department Location Name Salary
0 1 HR Delhi A 1000.0
1 2 NaN NaN B NaN
2 3 Finance NaN C 3000.0
3 4 NaN NaN D 8000.0
4 5 Programming NaN E 6000.0
5 8 Admin Mumbai B NaN
6 9 Ops Banglore D NaN
7 10 Analytics Mumbai K NaN
编辑:
对于某些列的顺序需要reindex
:
#concatenate all columns names togetehr and remove dupes
ColNames = pd.Index(np.concatenate([df_second.columns, df_first.columns])).drop_duplicates()
print (ColNames)
Index(['EmpID', 'Name', 'Department', 'Location', 'Salary'], dtype='object')
df = (df_first.set_index('EmpID')
.combine_first(df_second.set_index('EmpID'))
.reset_index()
.reindex(columns=ColNames))
print (df)
EmpID Name Department Location Salary
0 1 A HR Delhi 1000.0
1 2 B NaN NaN NaN
2 3 C Finance NaN 3000.0
3 4 D NaN NaN 8000.0
4 5 E Programming NaN 6000.0
5 8 B Admin Mumbai NaN
6 9 D Ops Banglore NaN
7 10 K Analytics Mumbai NaN