产生每一种可能的排列

时间:2018-02-01 10:03:46

标签: python pandas dataframe

我的数据框如下:

project_id    member_id
1             A
1             B
1             C
2             A
2             D
2             B

我想找到至少在一个项目上合作过的所有人。因此,结果数据框应如下所示:

member_id    co_member_id
A            B
A            C
A            D
B            A
B            C
B            D
C            A
C            B
D            A
D            B

我能想到的一种方法是df.groupby('project_id')但是我必须计算每个project_id中每个可能唯一值的成对排列,然后在生成的df中删除任何重复的配对。< / p>

我想知道是否有更有效的方法来做到这一点。

3 个答案:

答案 0 :(得分:3)

这是一种不依赖pandas

的功能方法
from itertools import permutations
from collections import defaultdict

project_id = [1, 1, 1, 2, 2, 2]
member_id = ['A', 'B', 'C', 'A', 'D', 'B']

d = defaultdict(list)

# create dictionary of project -> members
for i, j in zip(project_id, member_id):
    d[i].append(j)

# permute pairs and get union
set.union(*(set(permutations(v, 2)) for v in d.values()))

# {('A', 'B'),
#  ('A', 'C'),
#  ('A', 'D'),
#  ('B', 'A'),
#  ('B', 'C'),
#  ('B', 'D'),
#  ('C', 'A'),
#  ('C', 'B'),
#  ('D', 'A'),
#  ('D', 'B')}

答案 1 :(得分:2)

jp_data_analysis上面的一个很好的答案。但是,您正在丢失有关项目的信息,这些信息可能是也可能并不总是需要。下面的代码返回三行中的所有信息,没有任何显式循环。

import pandas as pd

# Create data frame
project_id = [1, 1, 1, 2, 2, 2]
member_id = ['A', 'B', 'C', 'A', 'D', 'B']
df = pd.DataFrame({'project_id': project_id, 'member_id': member_id})

# New data frame with co_member_id
df1 = pd.merge(df, df, how='inner', on=['project_id'])
df1 = df1[df1['member_id_x'] != df1['member_id_y']]
df1.columns = ['member_id', 'project_id', 'co_member_id']

print(df1)

   member_id  project_id co_member_id
1          A           1            B
2          A           1            C
3          B           1            A
5          B           1            C
6          C           1            A
7          C           1            B
10         A           2            D
11         A           2            B
12         D           2            A
14         D           2            B
15         B           2            A
16         B           2            D

多索引和groupby为您提供了非常简洁的结果:

df3 = df1.set_index(['member_id', 'co_member_id'])
df3 = df.groupby('project_id').sum()
print(df3)

           member_id
project_id          
1                ABC
2                ADB

答案 2 :(得分:0)

You can try something like this:

project_id = [1, 1, 1, 2, 2, 2]
member_id = ['A', 'B', 'C', 'A', 'D', 'B']

import itertools
track={}
combination=[]
for i in zip(project_id,member_id):
    if i[0] not in track:
        track[i[0]]=[i[1]]
    else:
        track[i[0]].append(i[1])

[combination.append(k) for i,j in track.items() for k in itertools.permutations(j,r=2) if k not in combination]


print({m:list(l) for m,l in itertools.groupby(sorted(combination),lambda x:x[0])})

output:

{'A': [('A', 'B'), ('A', 'C'), ('A', 'D')], 'B': [('B', 'A'), ('B', 'C'), ('B', 'D')], 'C': [('C', 'A'), ('C', 'B')], 'D': [('D', 'A'), ('D', 'B')]}