我的数据框如下所示,我的目的是pandas.groupby
上的学生姓名,找出他们在“英语”和“印地语”之间进行了哪些活动
data ={'StudentId':['AAdams','AAdams','AAdams','AAdams','AAdams','AAdams',
'BBrooks','BBrooks','BBrooks','BBrooks','BBrooks',],
'activity':['came school','english','lunch','hindi','sports','left school','came school','english','read','hindi','left school'],
'month':[11,11,11,11,12,12,12,12,12,1,1]}
pd.DataFrame(data)
StudentId activity month
0 AAdams came school 11
1 AAdams english 11
2 AAdams lunch 11
3 AAdams hindi 11
4 AAdams sports 12
5 AAdams left school 12
6 BBrooks came school 12
7 BBrooks english 12
8 BBrooks read 12
9 BBrooks hindi 1
10 BBrooks left school 1
到目前为止我尝试过的是
df[df.b.eq('english').groupby(df.StudentId).cumsum()].reset_index(drop=True)
or
df.groupby('StudentId').apply(lambda x: x.loc[(x.b == 'english').idxmax():,:])
.reset_index(drop=True)
然后削减我的数据框,然后我就可以通过以下代码进行操作
df.groupby('StudentId').head(5)
最终数据框或输出应看起来仅是activity = english和activity = hindi之间的活动
StudentId activity month
1 AAdams english 11
2 AAdams lunch 11
3 AAdams hindi 11
7 BBrooks english 12
8 BBrooks read 12
9 BBrooks hindi 1
答案 0 :(得分:2)
每组的第一个值为english
,第二个为hindi
的解决方案。
通过DataFrameGroupBy.cumsum
创建布尔掩码,通过使用[::-1]
进行索引来为第一个和第二个需要从后往后排序,通过&
创建最后一个链掩码,并通过boolean indexing
进行过滤:>
m1 = df['activity'].eq('english').astype(int).groupby(df['StudentId']).cumsum().gt(0)
m2 = df['activity'].eq('hindi').astype(int).iloc[::-1].groupby(df['StudentId']).cumsum().gt(0)
df = df[m1 & m2]
print (df)
StudentId activity month
1 AAdams english 11
2 AAdams lunch 11
3 AAdams hindi 11
7 BBrooks english 12
8 BBrooks read 12
9 BBrooks hindi 1