我有两个熊猫DataFrame:
df1 = pd.DataFrame({'user_id':['0','0','1','1','2','3','3'],
'friend_id':['1','2','3','2','4','4','5'],
'date_sent':['01-01-2020','01-01-2020','01-02-2020','01-03-2020','01-02-2020','01-03-2020','01-02-2020'],
'date_accepted':['01-01-2020','01-01-2020','01-02-2020',None,'01-10-2020',None,'01-21-2020']})
df2 = pd.DataFrame({'user_id':['1','1','2','2','3','3'],
'page_liked':['A','B','A','C','B','D']})
grouped1 = df1.groupby(['user_id','friend_id']).count()
grouped2 = df2.groupby(['user_id','page_liked']).count()
print(grouped1)
output >>>
date_sent date_accepted
user_id friend_id
0 1 1 1
2 1 1
1 2 1 0
3 1 1
2 4 1 1
3 4 1 0
5 1 1
grouped2
output >>>
user_id page_liked
1 A
B
2 A
C
3 B
D
我正在尝试将grouped1.friend_id
与grouped2.user_id
合并。目的是获得下表:
user_id friend_id page_liked
0 1 A
B
2 A
C
1 2 A
C
3 B
D
2 4 Na
3 4 Na
5 Na
由于索引是多级的,所以我尝试了多种方法merge
,但都没有碰到运气。我也尝试过grouped1.combine_first(grouped2)
,但是这似乎仅在索引级别相同时才起作用,因此我现在陷入困境。
答案 0 :(得分:0)
有关使用reset_index(),重命名该列并进行另一个groupby的关键步骤,请参见答案中的注释。
import pandas as pd
df1 = pd.DataFrame({'user_id':['0','0','1','1','2','3','3'],
'friend_id':['1','2','3','2','4','4','5'],
'date_sent':['01-01-2020','01-01-2020','01-02-2020','01-03-2020','01-02-2020','01-03-2020','01-02-2020'],
'date_accepted':['01-01-2020','01-01-2020','01-02-2020',None,'01-10-2020',None,'01-21-2020']})
df2 = pd.DataFrame({'user_id':['1','1','2','2','3','3'],
'page_liked':['A','B','A','C','B','D']})
#Use reset_index() to change indexes to columns and for group 2 rename the column to match the column you want to merge with
grouped1 = df1.groupby(['user_id','friend_id']).count().reset_index()
grouped2 = df2.groupby(['user_id','page_liked']).count().reset_index().rename(columns={'user_id':'friend_id'})
#merge and drop unnecessary columns and then do another groupby if you want to re-index.
grouped3=pd.merge(grouped1, grouped2, how='left', on=['friend_id']).drop(['date_sent', 'date_accepted'], axis=1)['page_liked'].min())
grouped3
答案 1 :(得分:0)
使用join
。它支持在多索引上合并多索引数据框。
您需要更改索引级别名称grouped2
以匹配索引级别名称grouped1
。由于您要在单个索引级别上进行匹配,因此只需更改一个级别的名称即可。因此,在grouped2
上,将级别名称user_id
更改为friend_id
。最后,加入,重新排列索引级别以及reset_index和slice
df_out = grouped1.join(grouped2.rename_axis(['friend_id', 'page_liked']),
how='left').swaplevel(0,1).reset_index(level=-1)[['page_liked']]
Out[82]:
page_liked
user_id friend_id
0 1 A
1 B
2 A
2 C
1 2 A
2 C
3 B
3 D
2 4 NaN
3 4 NaN
5 NaN