Pandas DataFrame切片基于逻辑条件?

时间:2016-09-22 18:24:31

标签: python mysql pandas

我有这个名为data的数据框:

       Subjects  Professor  StudentID
8     Chemistry       Jane        999
1     Chemistry       Jane       3455
0     Chemistry     Joseph       1234
2       History       Jane       3455
6       History      Smith        323
7       History      Smith        999
3   Mathematics        Doe      56767
10  Mathematics   Einstein       3455
5       Physics   Einstein       2834
9       Physics      Smith        323
4       Physics      Smith        999

我想运行此查询“至少有两个或两个以上相同学生的班级的教授”。期望的输出

Smith: Physics, History, 323, 999

我熟悉SQL并且可以很容易地做到这一点,但我仍然是Python的初学者。如何在Python中实现此输出?另一种思路是将此数据帧转换为SQL数据库,并通过python具有SQL接口来运行查询。有没有办法实现这个目标?

2 个答案:

答案 0 :(得分:2)

students_and_subjects = df.groupby(
                               ['Professor', 'Subjects']
                           ).StudentID.nunique().ge(2) \
                          .groupby(level='Professor').sum().ge(2)

df[df.Professor.map(students_and_subjects)]

enter image description here

答案 1 :(得分:1)

filtervalue_counts的解决方案:

df1 = df.groupby('Professor').filter(lambda x: (len(x.Subjects) > 1) & 
                                               ((x.StudentID.value_counts() > 1).sum() > 1))
print (df1)
  Subjects Professor  StudentID
6  History     Smith        323
7  History     Smith        999
9  Physics     Smith        323
4  Physics     Smith        999

duplicated

df1 = df.groupby('Professor').filter(lambda x: (len(x.Subjects) > 1) & 
                                               (x.StudentID.duplicated().sum() > 1))
print (df1)
  Subjects Professor  StudentID
6  History     Smith        323
7  History     Smith        999
9  Physics     Smith        323
4  Physics     Smith        999

通过评论编辑:

您可以从自定义功能返回自定义输出,然后按Series.dropna删除NaN行:

df.StudentID = df.StudentID.astype(str)

def f(x):
    if (len(x.Subjects) > 1) & (x.StudentID.duplicated().sum() > 1):
        return ', '.join((x.Subjects.unique().tolist() + x.StudentID.unique().tolist()))

df1 = df.groupby('Professor').apply(f).dropna()
df1 = df1.index.to_series() + ': ' + df1
print (df1)
Professor
Smith    Smith: History, Physics, 323, 999
dtype: object