我有一个像这样的数据框
df = pd.DataFrame( {'Item':['A','A','A','B','B','C','C','C','C'],
'Name':[Tom,John,Paul,Tom,Frank,Tom, John, Richard, James],
'Weight:[2,2,2,3,3,5, 5, 5, 5]'})
df
Item Name Weight
A Tom 4
A John 4
A Paul 4
B Tom 3
B Frank 3
C Tom 5
C John 5
C Richard 5
C James 5
对于我想要的每个人,weight
df1
Name People Times
Tom [John, Paul, Frank, Richard, James] [(1/4+1/5),1/4,1/3,1/5,1/5]
John [Tom, Richard, James] [(1/4+1/5),1/5,1/5]
Paul [Tom, John] [1/4,1/4]
Frank [Tom] [1/3]
Richard [Tom, John, James] [1/5,1/5,1/5]
James [Tom, John, Richard] [1/5,1/5,1/5]
为了在不考虑weight
的情况下计算协作时间,我做了:
#merge M:N by column Item
df1 = pd.merge(df, df, on=['Item'])
#remove duplicity - column Name_x == Name_y
df1 = df1[~(df1['Name_x'] == df1['Name_y'])]
#print df1
#create lists
df1 = df1.groupby('Name_x')['Name_y'].apply(lambda x: x.tolist()).reset_index()
print df1
Name_x Name_y
0 Frank [Tom]
1 James [Tom, John, Richard]
2 John [Tom, Paul, Tom, Richard, James]
3 Paul [Tom, John]
4 Richard [Tom, John, James]
5 Tom [John, Paul, Frank, John, Richard, James]
#get count by np.unique
df1['People'] = df1['Name_y'].apply(lambda a: np.unique((a), return_counts =True)[0])
df1['times'] = df1['Name_y'].apply(lambda a: np.unique((a), return_counts =True)[1])
#remove column Name_y
df1 = df1.drop('Name_y', axis=1).rename(columns={'Name_x':'Name'})
print df1
Name People times
0 Frank [Tom] [1]
1 James [John, Richard, Tom] [1, 1, 1]
2 John [James, Paul, Richard, Tom] [1, 1, 1, 2]
3 Paul [John, Tom] [1, 1]
4 Richard [James, John, Tom] [1, 1, 1]
5 Tom [Frank, James, John, Paul, Richard] [1, 1, 2, 1, 1]
在最后一个数据框中,我有所有对之间的协作计数,但我希望他们加权计算协作
答案 0 :(得分:0)
从:
开始df = pd.DataFrame({'Item': ['A', 'A', 'A', 'B', 'B', 'C', 'C', 'C', 'C'],
'Name': ['Tom', 'John', 'Paul', 'Tom', 'Frank', 'Tom', 'John', 'Richard', 'James'],
'Weight': [2, 2, 2, 3, 3, 5, 5, 5, 5]})
df1 = pd.merge(df, df, on=['Item'])
df1 = df1[~(df1['Name_x'] == df1['Name_y'])].set_index(['Name_x', 'Name_y']).drop(['Item', 'Weight_y'], axis=1)
您可以使用.apply()
为宽格式创建值和.unstack()
:
collab = df1.groupby(level=['Name_x', 'Name_y']).apply(lambda x: np.sum(1/x)).unstack().loc[:, 'Weight_x']
Name_y Frank James John Paul Richard Tom
Name_x
Frank NaN NaN NaN NaN NaN 0.333333
James NaN NaN 0.2 NaN 0.2 0.200000
John NaN 0.2 NaN 0.5 0.2 0.700000
Paul NaN NaN 0.5 NaN NaN 0.500000
Richard NaN 0.2 0.2 NaN NaN 0.200000
Tom 0.333333 0.2 0.7 0.5 0.2 NaN
然后迭代行并转换为列表:
df = pd.DataFrame(columns=['People', 'Times'])
for p, data in collab.iterrows():
s = data.dropna()
df.loc[p] = [s.index.tolist(), s.values]
People \
Frank [Tom]
James [John, Richard, Tom]
John [James, Paul, Richard, Tom]
Paul [John, Tom]
Richard [James, John, Tom]
Tom [Frank, James, John, Paul, Richard]
Times
Frank [0.333333333333]
James [0.2, 0.2, 0.2]
John [0.2, 0.5, 0.2, 0.7]
Paul [0.5, 0.5]
Richard [0.2, 0.2, 0.2]
Tom [0.333333333333, 0.2, 0.7, 0.5, 0.2]