合并多个数据帧,其中某些行不相同

时间:2018-03-16 11:04:07

标签: python pandas dataframe jupyter-notebook

所以我有来自FIFA 13-17的前80名球员的5个数据帧,每个球员包含球员姓名,评分和俱乐部。我的最终目标是将所有这些数据集合并在一起,这样我就可以每个玩家每年获得一个评级,如果没有,则可以获得空值。显然,有些球员每年都没有进入前80名,即:退休。 下面是三个数据帧的片段。

FIFA18

Name Overall Club 0 Cristiano Ronaldo 94 Real Madrid CF 1 L. Messi 93 FC Barcelona 2 Neymar 92 FC Barcelona 3 L. Suárez 92 FC Barcelona 4 M. Neuer 92 FC Bayern Munich 5 De Gea 90 Manchester United 6 R. Lewandowski 90 FC Bayern Munich 7 J. Boateng 90 FC Bayern Munich 8 G. Bale 90 Real Madrid CF 9 Z. Ibrahimović 90 Manchester United 10 T. Courtois 89 Chelsea

FIFA13

Name Overall Club 0 L. Messi 94 FC Barcelona 1 Cristiano Ronaldo 92 Real Madrid CF 2 F. Ribéry 90 FC Bayern Munich 3 Xavi 90 FC Barcelona 4 Iniesta 90 FC Barcelona 5 N. Vidić 89 Manchester United 6 W. Rooney 89 Manchester United 7 Casillas 89 Real Madrid CF 8 David Silva 88 Manchester City 9 Falcao 88 Atlético Madrid 10 Z. Ibrahimović 88 Paris Saint-Germain

出现这种情况的一个例子可能是N.Vidić已经退休。

我的目标表是这个

Name FIFA17 FIA13 Club 0 Cristiano Ronaldo 94 92 Real Madrid CF 1 L. Messi 93 94 FC Barcelona 2 Neymar 92 83 FC Barcelona 3 L. Suárez 92 86 FC Barcelona 4 M. Neuer 92 87 FC Bayern Munich 5 De Gea 90 82 Manchester United 6 R. Lewandowski 90 80 FC Bayern Munich 7 J. Boateng 90 84 FC Bayern Munich 8 G. Bale 90 86 Real Madrid CF 9 Z. Ibrahimović 90 88 Manchester United 10 T. Courtois 89 83 Chelsea 11 F. Ribéry 86 90 FC Bayern Munich 12 Xavi 0 90 FC Barcelona 13 Iniesta 88 90 FC Barcelona 14 N. Vidić 0 89 Manchester United 15 W. Rooney 0 89 Manchester United 16 Casillas 0 89 Real Madrid CF 17 David Silva 87 88 Manchester City 18 Falcao 0 88 Atlético Madrid

我是python和pandas的新手,但我尝试过使用join和merge但是它似乎总是使用每个表的索引而不是唯一的名称。

非常感谢任何帮助!

2 个答案:

答案 0 :(得分:3)

以下是通过pd.concatpivot_table的一种方式。它假设您能够将数据帧放在字典中,字典可以是任意长度。

该解决方案还涉及多个俱乐部,仅保留最新的俱乐部。

dfs = {13: df13, 18: df18}

df = pd.concat([dfs[k].assign(Year=k) for k in dfs])

club_map = df.sort_values('Year', ascending=False)\
             .drop_duplicates('Name')\
             .set_index('Name')['Club']

df['Club'] = df['Name'].map(club_map)

res = df.pivot_table(index=['Name', 'Club'], columns='Year',
                     values='Overall', aggfunc=np.sum, fill_value=0)\
        .reset_index().rename_axis(None, axis='columns')

<强>结果

                 Name               Club  13  18
0            Casillas     Real Madrid CF  89   0
1   Cristiano Ronaldo     Real Madrid CF  92  94
2         David Silva    Manchester City  88   0
3              De Gea  Manchester United   0  90
4           F. Ribéry   FC Bayern Munich  90   0
5              Falcao    Atlético Madrid  88   0
6             G. Bale     Real Madrid CF   0  90
7             Iniesta       FC Barcelona  90   0
8          J. Boateng   FC Bayern Munich   0  90
9            L. Messi       FC Barcelona  94  93
10          L. Suárez       FC Barcelona   0  92
11           M. Neuer   FC Bayern Munich   0  92
12           N. Vidić  Manchester United  89   0
13             Neymar       FC Barcelona   0  92
14     R. Lewandowski   FC Bayern Munich   0  90
15        T. Courtois            Chelsea   0  89
16          W. Rooney  Manchester United  89   0
17               Xavi       FC Barcelona  90   0
18     Z. Ibrahimović  Manchester United  88  90

答案 1 :(得分:2)

MultiIndex的{​​{3}}列中使用set_index,然后将NaN替换为concat,投放到integer并最后转换MultiIndexs1 = df1.drop_duplicates(['Name','Club']).set_index(['Name','Club'])['Overall'] s2 = df2.drop_duplicates(['Name','Club']).set_index(['Name','Club'])['Overall'] df = pd.concat([s2, s1], axis=1, keys=('FIFA13','FIFA18')).fillna(0).astype(int).reset_index() print (df) Name Club FIFA13 FIFA18 0 Casillas Real Madrid CF 89 0 1 Cristiano Ronaldo Real Madrid CF 92 94 2 David Silva Manchester City 88 0 3 De Gea Manchester United 0 90 4 F. Ribéry FC Bayern Munich 90 0 5 Falcao Atlético Madrid 88 0 6 G. Bale Real Madrid CF 0 90 7 Iniesta FC Barcelona 90 0 8 J. Boateng FC Bayern Munich 0 90 9 L. Messi FC Barcelona 94 93 10 L. Suárez FC Barcelona 0 92 11 M. Neuer FC Bayern Munich 0 92 12 N. Vidić Manchester United 89 0 13 Neymar FC Barcelona 0 92 14 R. Lewandowski FC Bayern Munich 0 90 15 T. Courtois Chelsean 0 89 16 W. Rooney Manchester United 89 0 17 Xavi FC Barcelona 90 0 18 Z. Ibrahimović Manchester United 0 90 19 Z. Ibrahimović Paris Saint-Germain 88 0

Names

如果订单是重要的解决方案类似,只能获得Clubs1 = df1.drop_duplicates(['Name','Club']).set_index(['Name','Club'])['Overall'] s2 = df2.drop_duplicates(['Name','Club']).set_index(['Name','Club'])['Overall'] df = pd.concat([s2, s1], axis=1, keys=('FIFA13','FIFA18')).fillna(0).astype(int) idx = pd.concat([df1[['Name','Club']], df2[['Name','Club']]]).drop_duplicates() df = df.reindex(idx).reset_index().drop_duplicates('Name', keep='last') print (df) Name Club FIFA13 FIFA18 0 L. Messi FC Barcelona 94 93 1 Cristiano Ronaldo Real Madrid CF 92 94 2 F. Ribéry FC Bayern Munich 90 0 3 Xavi FC Barcelona 90 0 4 Iniesta FC Barcelona 90 0 5 N. Vidić Manchester United 89 0 6 W. Rooney Manchester United 89 0 7 Casillas Real Madrid CF 89 0 8 David Silva Manchester City 88 0 9 Falcao Atlético Madrid 88 0 11 Neymar FC Barcelona 0 92 12 L. Suárez FC Barcelona 0 92 13 M. Neuer FC Bayern Munich 0 92 14 De Gea Manchester United 0 90 15 R. Lewandowski FC Bayern Munich 0 90 16 J. Boateng FC Bayern Munich 0 90 17 G. Bale Real Madrid CF 0 90 18 Z. Ibrahimović Manchester United 0 90 19 T. Courtois Chelsean 0 89 的唯一对,加入并删除重复项fillnareset_index

list comprehension

对于一般解决方案,请使用dfs = [df2, df1] names= ['FIFA13','FIFA18'] s = [x.drop_duplicates(['Name','Club']).set_index(['Name','Club'])['Overall'] for x in dfs] df = pd.concat(s, axis=1, keys=(names)).fillna(0).astype(int) s1 = [x[['Name','Club']] for x in dfs] idx = pd.concat(s1).drop_duplicates() df = df.reindex(idx).reset_index().drop_duplicates('Name', keep='last') s:

LIBC_FATAL_STDERR_=1