数据框

时间:2017-04-12 19:58:45

标签: python pandas group-by pycharm pandas-groupby

我有一个数据框:

df = pd.DataFrame({
    'exam': [
        'French', 'English', 'German', 'Russian', 'Russian',
        'German', 'German', 'French', 'English', 'French'
    ],

'student' : ['john', 'ted', 'jason', 'marc', 'peter', 'bob',
            'robert', 'david', 'nik', 'kevin'
]
})

print (df)

              exam   student   
    0       French    john     
    1       English   ted        
    2       German    jason         
    3       Russian   marc         
    4       Russian   peter         
    5       German    bob         
    6       German    robert         
    7       French    david         
    8       English   nik          
    9       French    kevin         

有没有人知道如何创建一个包含两列“学生”和“学生共享考试”的新数据框。

我应该得到类似的东西:

                student   shared_exam_with      
        0       john       david                   
        1       john       kevin            
        2       ted        nik                    
        3       jason      bob                 
        4       jason      robert                   
        5       marc       peter              
        6       peter      marc             
        7       bob        jason                    
        8       bob        robert                    
        9       robert     jason                      
       10       robert     bob                   
       11       david      john             
       12       david      kevin                      
       13       nik        ted                     
       14       kevin      john                     
       15       kevin      david                   

对于前:约翰带法国人......还有大卫和凯文!

有什么想法吗? 提前谢谢!

3 个答案:

答案 0 :(得分:6)

自我merge

df.merge(
    df, on='exam',
    suffixes=['', '_shared_with']
).query('student != student_shared_with')

       exam student student_shared_with
1    French    john               david
2    French    john               kevin
3    French   david                john
5    French   david               kevin
6    French   kevin                john
7    French   kevin               david
10  English     ted                 nik
11  English     nik                 ted
14   German   jason                 bob
15   German   jason              robert
16   German     bob               jason
18   German     bob              robert
19   German  robert               jason
20   German  robert                 bob
23  Russian    marc               peter
24  Russian   peter                marc

自我join

d1 = df.set_index('exam')
d1.join(
    d1, rsuffix='_shared_with'
).query('student != student_shared_with')

        student student_shared_with
exam                               
English     ted                 nik
English     nik                 ted
French     john               david
French     john               kevin
French    david                john
French    david               kevin
French    kevin                john
French    kevin               david
German    jason                 bob
German    jason              robert
German      bob               jason
German      bob              robert
German   robert               jason
German   robert                 bob
Russian    marc               peter
Russian   peter                marc

itertools.permutations + groupby

from itertools import permutations as perm

cols = ['student', 'student_shared_with']
df.groupby('exam').student.apply(
    lambda x: pd.DataFrame(list(perm(x, 2)), columns=cols)
).reset_index(drop=True)

   student student_shared_with
0      ted                 nik
1      nik                 ted
2     john               david
3     john               kevin
4    david                john
5    david               kevin
6    kevin                john
7    kevin               david
8    jason                 bob
9    jason              robert
10     bob               jason
11     bob              robert
12  robert               jason
13  robert                 bob
14    marc               peter
15   peter                marc

答案 1 :(得分:4)

一种方法是:

cross = pd.crosstab(df['student'], df['exam'])
res = cross.dot(cross.T)
res.where(np.triu(res, k=1).astype('bool')).stack()
Out: 
student  student
bob      jason      1.0
         robert     1.0
david    john       1.0
         kevin      1.0
jason    robert     1.0
john     kevin      1.0
marc     peter      1.0
nik      ted        1.0
dtype: float64

点积产生共生的二元矩阵。为了不重复相同的对,我用where和stack过滤它们。得到的系列的索引是具有相同考试的学生。

答案 2 :(得分:2)

这将是SQL中的一个步骤,但这里有两个:(1)将DataFrame(在考试中)与自身合并,以及(2)删除行是student == student_shared(因为学生没有'与自己分享)

df2 = pd.merge(
    df, df, how='outer', on='exam', suffixes=['', '_shared_with']).drop('exam', axis=1)
df2 = df2.loc[df2.student != df2.student_shared_with]

   student student_shared_with
1     john               david
2     john               kevin
3    david                john
5    david               kevin
6    kevin                john
7    kevin               david
10     ted                 nik
11     nik                 ted
14   jason                 bob
15   jason              robert
16     bob               jason
18     bob              robert
19  robert               jason
20  robert                 bob
23    marc               peter
24   peter                marc