我有一个数据框:
df = pd.DataFrame({
'exam': [
'French', 'English', 'German', 'Russian', 'Russian',
'German', 'German', 'French', 'English', 'French'
],
'student' : ['john', 'ted', 'jason', 'marc', 'peter', 'bob',
'robert', 'david', 'nik', 'kevin'
]
})
print (df)
exam student
0 French john
1 English ted
2 German jason
3 Russian marc
4 Russian peter
5 German bob
6 German robert
7 French david
8 English nik
9 French kevin
有没有人知道如何创建一个包含两列“学生”和“学生共享考试”的新数据框。
我应该得到类似的东西:
student shared_exam_with
0 john david
1 john kevin
2 ted nik
3 jason bob
4 jason robert
5 marc peter
6 peter marc
7 bob jason
8 bob robert
9 robert jason
10 robert bob
11 david john
12 david kevin
13 nik ted
14 kevin john
15 kevin david
对于前:约翰带法国人......还有大卫和凯文!
有什么想法吗? 提前谢谢!
答案 0 :(得分:6)
自我merge
df.merge(
df, on='exam',
suffixes=['', '_shared_with']
).query('student != student_shared_with')
exam student student_shared_with
1 French john david
2 French john kevin
3 French david john
5 French david kevin
6 French kevin john
7 French kevin david
10 English ted nik
11 English nik ted
14 German jason bob
15 German jason robert
16 German bob jason
18 German bob robert
19 German robert jason
20 German robert bob
23 Russian marc peter
24 Russian peter marc
自我join
d1 = df.set_index('exam')
d1.join(
d1, rsuffix='_shared_with'
).query('student != student_shared_with')
student student_shared_with
exam
English ted nik
English nik ted
French john david
French john kevin
French david john
French david kevin
French kevin john
French kevin david
German jason bob
German jason robert
German bob jason
German bob robert
German robert jason
German robert bob
Russian marc peter
Russian peter marc
itertools.permutations
+ groupby
from itertools import permutations as perm
cols = ['student', 'student_shared_with']
df.groupby('exam').student.apply(
lambda x: pd.DataFrame(list(perm(x, 2)), columns=cols)
).reset_index(drop=True)
student student_shared_with
0 ted nik
1 nik ted
2 john david
3 john kevin
4 david john
5 david kevin
6 kevin john
7 kevin david
8 jason bob
9 jason robert
10 bob jason
11 bob robert
12 robert jason
13 robert bob
14 marc peter
15 peter marc
答案 1 :(得分:4)
一种方法是:
cross = pd.crosstab(df['student'], df['exam'])
res = cross.dot(cross.T)
res.where(np.triu(res, k=1).astype('bool')).stack()
Out:
student student
bob jason 1.0
robert 1.0
david john 1.0
kevin 1.0
jason robert 1.0
john kevin 1.0
marc peter 1.0
nik ted 1.0
dtype: float64
点积产生共生的二元矩阵。为了不重复相同的对,我用where和stack过滤它们。得到的系列的索引是具有相同考试的学生。
答案 2 :(得分:2)
这将是SQL中的一个步骤,但这里有两个:(1)将DataFrame(在考试中)与自身合并,以及(2)删除行是student == student_shared(因为学生没有'与自己分享)
df2 = pd.merge(
df, df, how='outer', on='exam', suffixes=['', '_shared_with']).drop('exam', axis=1)
df2 = df2.loc[df2.student != df2.student_shared_with]
student student_shared_with
1 john david
2 john kevin
3 david john
5 david kevin
6 kevin john
7 kevin david
10 ted nik
11 nik ted
14 jason bob
15 jason robert
16 bob jason
18 bob robert
19 robert jason
20 robert bob
23 marc peter
24 peter marc