熊猫数据框分组依据,然后根据菜单或文本选项进行过滤

时间:2019-01-30 06:49:11

标签: python pandas dataframe group-by

我的数据框如下所示,我的目的是pandas.groupby上的学生姓名,找出他们在“英语”和“印地语”之间进行了哪些活动

 data ={'StudentId':['AAdams','AAdams','AAdams','AAdams','AAdams','AAdams',
                'BBrooks','BBrooks','BBrooks','BBrooks','BBrooks',],

'activity':['came school','english','lunch','hindi','sports','left school','came school','english','read','hindi','left school'],
'month':[11,11,11,11,12,12,12,12,12,1,1]}

pd.DataFrame(data)

StudentId   activity    month
0   AAdams  came school 11
1   AAdams  english 11
2   AAdams  lunch   11
3   AAdams  hindi   11
4   AAdams  sports  12
5   AAdams  left school 12
6   BBrooks came school 12
7   BBrooks english 12
8   BBrooks read    12
9   BBrooks hindi   1
10  BBrooks left school 1

到目前为止我尝试过的是

df[df.b.eq('english').groupby(df.StudentId).cumsum()].reset_index(drop=True)

or 

df.groupby('StudentId').apply(lambda x: x.loc[(x.b == 'english').idxmax():,:])
                .reset_index(drop=True)

然后削减我的数据框,然后我就可以通过以下代码进行操作

df.groupby('StudentId').head(5)

最终数据框或输出应看起来仅是activity = english和activity = hindi之间的活动

    StudentId   activity    month
1   AAdams  english 11
2   AAdams  lunch   11
3   AAdams  hindi   11
7   BBrooks english 12
8   BBrooks read    12
9   BBrooks hindi   1

1 个答案:

答案 0 :(得分:2)

每组的第一个值为english,第二个为hindi的解决方案。

通过DataFrameGroupBy.cumsum创建布尔掩码,通过使用[::-1]进行索引来为第一个和第二个需要从后往后排序,通过&创建最后一个链掩码,并通过boolean indexing进行过滤:

m1 = df['activity'].eq('english').astype(int).groupby(df['StudentId']).cumsum().gt(0)
m2 = df['activity'].eq('hindi').astype(int).iloc[::-1].groupby(df['StudentId']).cumsum().gt(0)

df = df[m1 & m2]
print (df)
  StudentId activity  month
1    AAdams  english     11
2    AAdams    lunch     11
3    AAdams    hindi     11
7   BBrooks  english     12
8   BBrooks     read     12
9   BBrooks    hindi      1