创建一个具有两个数据框的矩阵-大熊猫?

时间:2019-05-06 08:05:19

标签: python pandas numpy matrix pandas-groupby

我有两个数据,一个带有列:

   df1 =   
           ID   As        Hs        Ts
           A    A_1       A_6       A_7
           B    B_1  
           C    C_1                 C10
           D    D_1  
           E    E_1,E_2   E_5       E_4
           F              F_1,F_4

一对配对得分:

df2 = 
          ID1   1         ID2      2       SCORE
          A     A_1       B        B_1     1
          A     A_6       B        B_1     0.5
          A     A_7       B        B_1     0.3
          A     A_1       C        C_1     1
          A     A_6       C        C_1     0.4
          A     A_7       C        C_1     0.3
          A     A_1       C        C_10    0.3
          A     A_6       C        C_10    0.5
          A     A_7       C        C_10    0.3
          A     A_1       D        D_1     1
          A     A_6       D        D_1     0.2
          A     A_7       D        D_1     0.3
          A     A_1       E        E_1     1
          A     A_6       E        E_1     0.5
          A     A_7       E        E_1     0.4
          A     A_1       E        E_2     0.8
          A     A_6       E        E_2     0.2
          A     A_7       E        E_2     0.5
          A     A_1       E        E_5     0.3
          A     A_6       E        E_5     0.3
          A     A_7       E        E_5     0.6
          A     A_1       E        E_4     0.1
          A     A_6       E        E_4     0.4
          A     A_7       E        E_4     0.6
          A     A_1       F        F_1     0.3
          A     A_6       F        F_1     0.3
          A     A_7       F        F_1     0.6
          A     A_1       F        F_4     0.1
          A     A_6       F        F_4     0.4
          A     A_7       F        F_4     0.6
          B     B_1       C        C_1     0.6
          B     B_1       C        C_10    0.1
          B     B_1       D        D_1     0.4
          B     B_1       E        E_1     0.6
          B     B_1       E        E_2     0.2
          B     B_1       E        E_5     0.3
          B     B_1       E        E_4     0.6
          B     B_1       F        F_1     0.4
          B     B_1       F        F_4     0.9
          C     C_1       D        D_1     0.8
          C     C_1       E        E_1     0.6
          C     C_1       E        E_2     0.4
          C     C_1       E        E_4     0.3
          C     C_1       E        E_5     0.2
          C     C_1       F        F_1     0.3
          C     C_1       F        F_4     0.4
          C     C_10      D        D_1     0.2
          C     C_10      E        E_1     0.3
          C     C_10      E        E_2     0.4
          C     C_10      E        E_5     0.3
          C     C_10      E        E_4     0.4
          C     C_10      F        F_1     0.3
          C     C_10      F        F_4     0.2
          D     D_1       F        F_4     1
          D     D_1       E        E_2     0.5
          D     D_1       E        E_5     0.3
          D     D_1       E        E_4     0.2
          D     D_1       F        F_1     0.5
          D     D_1       F        F_4     0.2
          E     E_1       F        F_1     0.9
          E     E_1       F        F_4     0.2
          E     E_2       F        F_1     0.3
          E     E_2       F        F_4     0.2
          E     E_5       F        F_1     0.5
          E     E_5       F        F_4     0.3
          E     E_4       F        F_1     0.6
          E     E_4       F        F_4     0.3

我想要的矩阵输出为:

          As                         Hs                Ts
          A_1 B_1 C_1 D_1 E_1 E_2    A_6 E_5 F_1 F_4   A_7 C_10 E_4
As   A_1      1   1   1   1   0.8        0.3 0.3 0.1        0.3 0.1
     B_1  1       0.6 0.4 0.6 0.2    0.5 0.3 0.4 0.9   0.3  0.1 0.6
     C_1  1   0.6     0.8 0.6 0.4    0.4 0.2 0.3 0.4   0.3      0.3
     D_1  1   0.4 0.8     1   0.5    0.2 0.3 0.5 0.2   0.3  0.2 0.2
     E_1  1   0.6 0.6 1              0.5         0.2   0.4  0.3 
     E_2  0.8 0.2 0.4 1              0.2         0.2   0.5  0.4

Hs   A_6      0.5 0.4 0.2 0.5 0.2        0.3 0.3 0.4        0.5 0.4
     E_5  0.3 0.3 0.2 0.3            0.3               0.6  0.3 
     F_1  0.3 0.4 0.3 0.5 0.9 0.3    0.3               0.6  0.3 0.6
     F_4  0.1 0.9 0.4 0.2 0.2 0.2    0.4               0.6  0.2 0.3

Ts   A_7      0.3 0.3 0.3 0.4 0.5        0.6 0.6 0.6        0.3 0.6
     C_10 0.3 0.1                    0.5               0.3      0.4
     E_4  0.1 0.6 0.3 0.2            0.4               0.6  0.4 

请注意,没有分数的对在输出矩阵中应为空。

我应该尝试使用pd.crosstab吗? df.pivot_table吗? 分组和取消堆叠?

如何获得所需的输出?任何建议,将不胜感激。 请注意,没有分数的对在输出矩阵中应为空。 谢谢

1 个答案:

答案 0 :(得分:0)

这是一个解决方案的示例,困难在于按照所需的内容对数据进行排序..:我选择了另一个小示例

import pandas as pd
import numpy as np
idx ="""
grp  id   
As  A_1
As  B_1
As  C_1 
As  D_1
As  E_1
As  E_2
Hs  A_6
Hs  E_5
Hs  F_1
Hs  F_4
Ts  A_7
Ts  C_10
Ts  E_4 
"""
data="""
      ID1   1        ID2        2       SCORE
      A     A_1       B        B_1     1
      A     F_1       B        B_1     1          
      A     A_6       B        E_2     0.5
      A     A_7       B        B_1     0.3
      A     A_1       C        C_1     1
      A     A_6       C        C_1     0.4
      A     A_7       C        E_5     0.3
      A     A_1       C        C_10    0.3
      A     A_6       C        C_10    0.5
      A     A_7       C        C_10    0.3
      A     A_1       D        D_1     1
      A     A_6       D        D_1     0.2
      A     A_7       D        D_1     0.3
      A     A_7       E        E_4     0.6
      A     A_1       F        E_1     0.3
      A     E_5       F        F_1     0.3
      A     A_7       F        F_1     0.6
      A     A_1       F        F_4     0.1
      A     A_6       F        F_4     0.4
       """

df = pd.read_csv(pd.compat.StringIO(data), sep='\s+')
ix = pd.read_csv(pd.compat.StringIO(idx), sep='\s+')
df.drop(['ID1', 'ID2'], axis=1, inplace=True)


df1 = df.copy(deep=True)
#i append (col 1, col 2) from df1 to (col 2, col 1) to df
#i could build my crosstab after with groupby
df1.columns = ['2', '1', 'SCORE']
df = df.append(df1, sort=False)

#i link the groupname As,Hs,Ts to the name of player and i concatenate the information
df = pd.merge(df, ix, left_on='1', right_on='id')
df['1'] = '(' + df['grp'].map(str) + ', ' + df['1'].map(str) + ')'
df.drop(['grp', 'id'],axis=1, inplace=True)

df = pd.merge(df, ix, left_on='2', right_on='id')
df['2'] = '(' + df['grp'].map(str) + ', ' + df['2'].map(str) + ')'
df.drop(['grp', 'id'],axis=1, inplace=True)

#i groupby player and i unstack to build the crosstab
df = df.groupby([ '1','2']).SCORE.max().unstack().fillna(' ')

print(df)

结果:

2          (As, A_1) (As, B_1) (As, C_1)  ... (Ts, A_7) (Ts, C_10) (Ts, E_4)
1                                         ...                               
(As, A_1)                    1         1  ...                  0.3          
(As, B_1)          1                      ...       0.3                     
(As, C_1)          1                      ...                               
(As, D_1)          1                      ...       0.3                     
(As, E_1)        0.3                      ...                               
(As, E_2)                                 ...                               
(Hs, A_6)                            0.4  ...                  0.5          
(Hs, E_5)                                 ...       0.3                     
(Hs, F_1)                    1            ...       0.6                     
(Hs, F_4)        0.1                      ...                               
(Ts, A_7)                  0.3            ...                  0.3       0.6
(Ts, C_10)       0.3                      ...       0.3                     
(Ts, E_4)                                 ...       0.6                     

对列使用多索引和标头的另一种解决方案:

df = pd.read_csv(pd.compat.StringIO(data), sep='\s+')
ix = pd.read_csv(pd.compat.StringIO(idx), sep='\s+')
df.drop(['ID1', 'ID2'], axis=1, inplace=True)
df1 = df.copy(deep=True)
df1.columns = ['2', '1', 'SCORE']

As = ['A_1', 'B_1', 'C_1' , 'D_1', 'E_1', 'E_2']
Hs = ['A_6', 'E_5', 'F_1', 'F_4']
Ts = ['A_7', 'C_10', 'E_4']

df = df.append(df1, sort=False)

df = pd.merge(df, ix, left_on='1', right_on='id')
df.drop(['id'], axis=1, inplace=True)
df = pd.merge(df, ix, left_on='2', right_on='id')
df.drop(['id'],axis=1, inplace=True)

df = df.groupby(['grp_x', '1','2']).SCORE.max().unstack().fillna(' ')
df = df[As + Hs + Ts]
header = ['As', 'As', 'As', 'As', 'As', 'As', 'Hs', 'Hs', 'Hs', 'Hs', 'Ts', 'Ts', 'Ts']
df.columns = pd.MultiIndex.from_tuples(list(zip(header, df.columns)))
print(df)

结果:

             As                            Hs                  Ts          
            A_1  B_1  C_1  D_1  E_1  E_2  A_6  E_5  F_1  F_4  A_7 C_10  E_4
grp_x 1                                                                    
As    A_1          1    1    1  0.3                      0.1       0.3     
      B_1     1                                       1       0.3          
      C_1     1                           0.4                              
      D_1     1                           0.2                 0.3          
      E_1   0.3                                                            
      E_2                                 0.5                              
Hs    A_6             0.4  0.2       0.5                 0.4       0.5     
      E_5                                           0.3       0.3          
      F_1          1                           0.3            0.6          
      F_4   0.1                           0.4                              
Ts    A_7        0.3       0.3                 0.3  0.6            0.3  0.6
      C_10  0.3                           0.5                 0.3          
      E_4                                                     0.6  

如果我使用您的样品,结果:

             As                            Hs                  Ts          
            A_1  B_1  C_1  D_1  E_1  E_2  A_6  E_5  F_1  F_4  A_7 C_10  E_4
grp_x 1                                                                    
As    A_1          1    1    1    1  0.8       0.3  0.3  0.1       0.3  0.1
      B_1     1       0.6  0.4  0.6  0.2  0.5  0.3  0.4  0.9  0.3  0.1  0.6
      C_1     1  0.6       0.8  0.6  0.4  0.4  0.2  0.3  0.4  0.3       0.3
      D_1     1  0.4  0.8            0.5  0.2  0.3  0.5    1  0.3  0.2  0.2
      E_1     1  0.6  0.6                 0.5       0.9  0.2  0.4  0.3     
      E_2   0.8  0.2  0.4  0.5            0.2       0.3  0.2  0.5  0.4     
Hs    A_6        0.5  0.4  0.2  0.5  0.2       0.3  0.3  0.4       0.5  0.4
      E_5   0.3  0.3  0.2  0.3            0.3       0.5  0.3  0.6  0.3     
      F_1   0.3  0.4  0.3  0.5  0.9  0.3  0.3  0.5            0.6  0.3  0.6
      F_4   0.1  0.9  0.4    1  0.2  0.2  0.4  0.3            0.6  0.2  0.3
Ts    A_7        0.3  0.3  0.3  0.4  0.5       0.6  0.6  0.6       0.3  0.6
      C_10  0.3  0.1       0.2  0.3  0.4  0.5  0.3  0.3  0.2  0.3       0.4
      E_4   0.1  0.6  0.3  0.2            0.4       0.6  0.3  0.6  0.4