您好,我有一个像这样的数据框
df = pd.DataFrame( {'Item':['A','A','A','B','B','C','C','C','C'],
'b':[Tom,John,Paul,Tom,Frank,Tom, John, Richard, James]})
df
Item Name
A Tom
A John
A Paul
B Tom
B Frank
C Tom
C John
C Richard
C James
对于每个人,我想要具有相同项目和时间的人员列表
df1
Name People Times
Tom [John, Paul, Frank, Richard, James] [2,1,1,1,1]
John [Tom, Richard, James] [2,1,1]
Paul [Tom, John] [1,1]
Frank [Tom] [1]
Richard [Tom, John, James] [1,1,1]
James [Tom, John, Richard] [1,1,1]
到目前为止,我已经尝试过这个来计算不同人物的不同物品
df.groupby("Item").agg({ "Name": pd.Series.nunique})
Name
Item
A 3
B 2
C 4
和
df.groupby("Name").agg({ "Item": pd.Series.nunique})
Item
Name
Frank 1
James 1
John 2
Paul 1
Richard 1
Tom 3
答案 0 :(得分:1)
您可以将numpy.unique
用于列表中的计数项目:
print df
Item Name
0 A Tom
1 A John
2 A Paul
3 B Tom
4 B Frank
5 C Tom
6 C John
7 C Richard
8 C James
#merge M:N by column Item
df1 = pd.merge(df, df, on=['Item'])
#remove duplicity - column Name_x == Name_y
df1 = df1[~(df1['Name_x'] == df1['Name_y'])]
#print df1
#create lists
df1 = df1.groupby('Name_x')['Name_y'].apply(lambda x: x.tolist()).reset_index()
print df1
Name_x Name_y
0 Frank [Tom]
1 James [Tom, John, Richard]
2 John [Tom, Paul, Tom, Richard, James]
3 Paul [Tom, John]
4 Richard [Tom, John, James]
5 Tom [John, Paul, Frank, John, Richard, James]
#get count by np.unique
df1['People'] = df1['Name_y'].apply(lambda a: np.unique((a), return_counts =True)[0])
df1['times'] = df1['Name_y'].apply(lambda a: np.unique((a), return_counts =True)[1])
#remove column Name_y
df1 = df1.drop('Name_y', axis=1).rename(columns={'Name_x':'Name'})
print df1
Name People times
0 Frank [Tom] [1]
1 James [John, Richard, Tom] [1, 1, 1]
2 John [James, Paul, Richard, Tom] [1, 1, 1, 2]
3 Paul [John, Tom] [1, 1]
4 Richard [James, John, Tom] [1, 1, 1]
5 Tom [Frank, James, John, Paul, Richard] [1, 1, 2, 1, 1]
答案 1 :(得分:1)
ct = pd.crosstab(df.Name, df.Item)
d = {Name: [(name, val)
for name, val in ct.loc[ct.index != Name, ct.ix[Name] == 1]
.sum(axis=1).iteritems() if val]
for Name in df.Name.unique()}
>>> pd.DataFrame({'Name': d.keys(),
'People': [[t[0] for t in d[name]] for name in d],
'times': [[t[1] for t in d[name]] for name in d]})
Name People times
0 Richard [James, John, Tom] [1, 1, 1]
1 James [John, Richard, Tom] [1, 1, 1]
2 Tom [Frank, James, John, Paul, Richard] [1, 1, 2, 1, 1]
3 Frank [Tom] [1]
4 Paul [John, Tom] [1, 1]
5 John [James, Paul, Richard, Tom] [1, 1, 1, 2]
<强>说明强>
交叉表按项目类型为您提供每个名称的位置。
>>> ct
Item A B C
Name
Frank 0 1 0
James 0 0 1
John 1 0 1
Paul 1 0 0
Richard 0 0 1
Tom 1 1 1
然后缩小此表。对于正在构建密钥的每个名称,将从表中删除该名称,并且仅选择显示该名称的列。
使用&#39; John&#39;举个例子:
>>> ct.loc[ct.index != 'John', ct.ix['John'] == 1]
Item A C
Name
Frank 0 0
James 0 1
Paul 1 0
Richard 0 1
Tom 1 1
然后将这个结果沿着行求和,得到John的结果:
Name
Frank 0
James 1
Paul 1
Richard 1
Tom 2
dtype: int64
然后迭代这些结果以将它们打包成元组对并移除值为零的情况(例如上面的Frank)。
>>> [(name, val) for name, val in
ct.loc[ct.index != 'John', ct.ix['John'] == 1].sum(axis=1).iteritems() if val]
[('James', 1), ('Paul', 1), ('Richard', 1), ('Tom', 2)]
使用词典理解为每个名称执行此操作。
>>> d
{'Frank': [('Tom', 1)],
'James': [('John', 1), ('Richard', 1), ('Tom', 1)],
'John': [('James', 1), ('Paul', 1), ('Richard', 1), ('Tom', 2)],
'Paul': [('John', 1), ('Tom', 1)],
'Richard': [('James', 1), ('John', 1), ('Tom', 1)],
'Tom': [('Frank', 1), ('James', 1), ('John', 2), ('Paul', 1), ('Richard', 1)]}
然后使用这个字典创建所需的数据帧,使用嵌套列表解析来解包元组对。