我的数据框如下:
project_id member_id
1 A
1 B
1 C
2 A
2 D
2 B
我想找到至少在一个项目上合作过的所有人。因此,结果数据框应如下所示:
member_id co_member_id
A B
A C
A D
B A
B C
B D
C A
C B
D A
D B
我能想到的一种方法是df.groupby('project_id')
但是我必须计算每个project_id
中每个可能唯一值的成对排列,然后在生成的df中删除任何重复的配对。< / p>
我想知道是否有更有效的方法来做到这一点。
答案 0 :(得分:3)
这是一种不依赖pandas
:
from itertools import permutations
from collections import defaultdict
project_id = [1, 1, 1, 2, 2, 2]
member_id = ['A', 'B', 'C', 'A', 'D', 'B']
d = defaultdict(list)
# create dictionary of project -> members
for i, j in zip(project_id, member_id):
d[i].append(j)
# permute pairs and get union
set.union(*(set(permutations(v, 2)) for v in d.values()))
# {('A', 'B'),
# ('A', 'C'),
# ('A', 'D'),
# ('B', 'A'),
# ('B', 'C'),
# ('B', 'D'),
# ('C', 'A'),
# ('C', 'B'),
# ('D', 'A'),
# ('D', 'B')}
答案 1 :(得分:2)
jp_data_analysis上面的一个很好的答案。但是,您正在丢失有关项目的信息,这些信息可能是也可能并不总是需要。下面的代码返回三行中的所有信息,没有任何显式循环。
import pandas as pd
# Create data frame
project_id = [1, 1, 1, 2, 2, 2]
member_id = ['A', 'B', 'C', 'A', 'D', 'B']
df = pd.DataFrame({'project_id': project_id, 'member_id': member_id})
# New data frame with co_member_id
df1 = pd.merge(df, df, how='inner', on=['project_id'])
df1 = df1[df1['member_id_x'] != df1['member_id_y']]
df1.columns = ['member_id', 'project_id', 'co_member_id']
print(df1)
member_id project_id co_member_id
1 A 1 B
2 A 1 C
3 B 1 A
5 B 1 C
6 C 1 A
7 C 1 B
10 A 2 D
11 A 2 B
12 D 2 A
14 D 2 B
15 B 2 A
16 B 2 D
多索引和groupby为您提供了非常简洁的结果:
df3 = df1.set_index(['member_id', 'co_member_id'])
df3 = df.groupby('project_id').sum()
print(df3)
member_id
project_id
1 ABC
2 ADB
答案 2 :(得分:0)
You can try something like this:
project_id = [1, 1, 1, 2, 2, 2]
member_id = ['A', 'B', 'C', 'A', 'D', 'B']
import itertools
track={}
combination=[]
for i in zip(project_id,member_id):
if i[0] not in track:
track[i[0]]=[i[1]]
else:
track[i[0]].append(i[1])
[combination.append(k) for i,j in track.items() for k in itertools.permutations(j,r=2) if k not in combination]
print({m:list(l) for m,l in itertools.groupby(sorted(combination),lambda x:x[0])})
output:
{'A': [('A', 'B'), ('A', 'C'), ('A', 'D')], 'B': [('B', 'A'), ('B', 'C'), ('B', 'D')], 'C': [('C', 'A'), ('C', 'B')], 'D': [('D', 'A'), ('D', 'B')]}