Python:如何计算pandas数据帧中对之间的协作?

时间:2016-04-29 08:47:01

标签: python pandas group-by unique

我有一个像这样的数据框

df = pd.DataFrame( {'Item':['A','A','A','B','B','C','C','C','C'], 
'Name':[Tom,John,Paul,Tom,Frank,Tom, John, Richard, James],
 'Weight:[2,2,2,3,3,5, 5, 5, 5]'})
df 
Item Name  Weight
A    Tom     4
A    John    4
A    Paul    4
B    Tom     3
B    Frank   3
C    Tom     5
C    John    5
C    Richard 5
C    James   5 

对于我想要的每个人,weight

平均拥有相同项目的人员列表
df1 
Name              People                          Times
Tom     [John, Paul, Frank, Richard, James]       [(1/4+1/5),1/4,1/3,1/5,1/5]
John    [Tom, Richard, James]                     [(1/4+1/5),1/5,1/5]
Paul    [Tom, John]                               [1/4,1/4]
Frank   [Tom]                                     [1/3]
Richard [Tom, John, James]                        [1/5,1/5,1/5]
James   [Tom, John, Richard]                      [1/5,1/5,1/5]

为了在不考虑weight的情况下计算协作时间,我做了:

#merge M:N by column Item
df1 = pd.merge(df, df, on=['Item'])

#remove duplicity - column Name_x == Name_y
df1 = df1[~(df1['Name_x'] == df1['Name_y'])]
#print df1

#create lists
df1 = df1.groupby('Name_x')['Name_y'].apply(lambda x: x.tolist()).reset_index()
print df1
    Name_x                                     Name_y
0    Frank                                      [Tom]
1    James                       [Tom, John, Richard]
2     John           [Tom, Paul, Tom, Richard, James]
3     Paul                                [Tom, John]
4  Richard                         [Tom, John, James]
5      Tom  [John, Paul, Frank, John, Richard, James]


#get count by np.unique
df1['People'] = df1['Name_y'].apply(lambda a: np.unique((a), return_counts =True)[0])
df1['times'] = df1['Name_y'].apply(lambda a: np.unique((a), return_counts =True)[1])
#remove column Name_y
df1 = df1.drop('Name_y', axis=1).rename(columns={'Name_x':'Name'})
print df1
      Name                               People            times
0    Frank                                [Tom]              [1]
1    James                 [John, Richard, Tom]        [1, 1, 1]
2     John          [James, Paul, Richard, Tom]     [1, 1, 1, 2]
3     Paul                          [John, Tom]           [1, 1]
4  Richard                   [James, John, Tom]        [1, 1, 1]
5      Tom  [Frank, James, John, Paul, Richard]  [1, 1, 2, 1, 1]

在最后一个数据框中,我有所有对之间的协作计数,但我希望他们加权计算协作

1 个答案:

答案 0 :(得分:0)

从:

开始
df = pd.DataFrame({'Item': ['A', 'A', 'A', 'B', 'B', 'C', 'C', 'C', 'C'],
                   'Name': ['Tom', 'John', 'Paul', 'Tom', 'Frank', 'Tom', 'John', 'Richard', 'James'],
                   'Weight': [2, 2, 2, 3, 3, 5, 5, 5, 5]})

df1 = pd.merge(df, df, on=['Item'])
df1 = df1[~(df1['Name_x'] == df1['Name_y'])].set_index(['Name_x', 'Name_y']).drop(['Item', 'Weight_y'], axis=1)

您可以使用.apply()为宽格式创建值和.unstack()

collab = df1.groupby(level=['Name_x', 'Name_y']).apply(lambda x: np.sum(1/x)).unstack().loc[:, 'Weight_x']

Name_y      Frank  James  John  Paul  Richard       Tom
Name_x                                                 
Frank         NaN    NaN   NaN   NaN      NaN  0.333333
James         NaN    NaN   0.2   NaN      0.2  0.200000
John          NaN    0.2   NaN   0.5      0.2  0.700000
Paul          NaN    NaN   0.5   NaN      NaN  0.500000
Richard       NaN    0.2   0.2   NaN      NaN  0.200000
Tom      0.333333    0.2   0.7   0.5      0.2       NaN

然后迭代行并转换为列表:

df = pd.DataFrame(columns=['People', 'Times'])
for p, data in collab.iterrows():
    s = data.dropna()
    df.loc[p] = [s.index.tolist(), s.values]

                                      People  \
Frank                                  [Tom]   
James                   [John, Richard, Tom]   
John             [James, Paul, Richard, Tom]   
Paul                             [John, Tom]   
Richard                   [James, John, Tom]   
Tom      [Frank, James, John, Paul, Richard]   

                                        Times  
Frank                        [0.333333333333]  
James                         [0.2, 0.2, 0.2]  
John                     [0.2, 0.5, 0.2, 0.7]  
Paul                               [0.5, 0.5]  
Richard                       [0.2, 0.2, 0.2]  
Tom      [0.333333333333, 0.2, 0.7, 0.5, 0.2]