熊猫:计算与pivot_table或交叉表的重叠

时间:2016-02-19 13:19:09

标签: python pandas pivot-table

我正在尝试与数据框中的某些数据重叠。 这是一个简单的例子:

df=pd.DataFrame({
'player':['A', 'B', 'C', 'D', 'A', 'C', 'B'], 
'game':['gameA', 'gameB', 'gameC', 'gameC', 'gameB', 'gameD', 'gameA']})

DF:

    game player
0  gameA      A
1  gameB      B
2  gameC      C
3  gameC      D
4  gameB      A
5  gameD      C
6  gameA      B

我想要做的是计算每个组合在两场比赛中的球员数量。

例如,结果应如下所示:

   game1 game2   overlap
  gameA  gameB        2 #Because there is 2 players who play at gameA and gameB
  gameA  gameC        0
  gameA  gameD        0
  gameB  gameA        2         
  gameB  gameC        0
  gameB  gameD        0          
  ...

我可以使用dictionnary和foreach来做到这一点,但有一种简单的方法可以使用pivot_table或交叉表吗?

非常感谢。

1 个答案:

答案 0 :(得分:0)

您可以使用pd.merge创建game_table

game_table = pd.merge(df, df, how='left', on=['player'])
#    game_x player game_y
# 0   gameA      A  gameA
# 1   gameA      A  gameB
# 2   gameB      B  gameB
# 3   gameB      B  gameA
# 4   gameC      C  gameC
# 5   gameC      C  gameD
# 6   gameC      D  gameC
# 7   gameB      A  gameA
# 8   gameB      A  gameB
# 9   gameD      C  gameC
# 10  gameD      C  gameD
# 11  gameA      B  gameB
# 12  gameA      B  gameA

然后将pd.crosstab应用于game_table

freq = pd.crosstab(game_table['game_x'], game_table['game_y'])
# game_y  gameA  gameB  gameC  gameD
# game_x                            
# gameA       2      2      0      0
# gameB       2      2      0      0
# gameC       0      0      2      1
# gameD       0      0      1      1

stack后跟reset_index将DataFrame重新整形为所需的格式:

result = freq.stack().reset_index()
import pandas as pd
df = pd.DataFrame(
    {'player':['A', 'B', 'C', 'D', 'A', 'C', 'B'], 
     'game':['gameA', 'gameB', 'gameC', 'gameC', 'gameB', 'gameD', 'gameA']})

game_table = pd.merge(df, df, how='left', on=['player'])
freq = pd.crosstab(game_table['game_x'], game_table['game_y'])
result = freq.stack()
result.name = 'overlap'
result = result.reset_index()
mask = (result['game_x'] != result['game_y'])
result = result.loc[mask]
print(result)

产量

   game_x game_y  overlap
1   gameA  gameB        2  # Because both A and B played in gameA and gameB
2   gameA  gameC        0
3   gameA  gameD        0
4   gameB  gameA        2
6   gameB  gameC        0
7   gameB  gameD        0
8   gameC  gameA        0
9   gameC  gameB        0
11  gameC  gameD        1
12  gameD  gameA        0
13  gameD  gameB        0
14  gameD  gameC        1