A B C D E
0 2002-01-12 2018-04-25 10:00:00 John 19 19
1 2002-01-12 2018-04-25 11:00:00 John 6 25
2 2002-01-13 2018-04-25 09:00:00 John 5 30
3 2002-01-13 2018-04-25 11:00:00 John -25 5
4 2002-01-14 2018-04-25 11:00:00 John 1 6
5 2002-01-14 2018-04-25 12:00:00 John 44 50
6 2002-01-25 2018-04-25 11:00:00 George 18 18
7 2002-01-25 2018-04-25 12:00:00 George 12 30
8 2002-01-26 2018-04-25 11:00:00 George -8 22
9 2002-01-26 2018-04-25 12:00:00 George -10 12
10 2002-01-27 2018-04-25 10:00:00 George 13 25
11 2002-01-27 2018-04-25 11:00:00 George 1 26
df['A'] = df['A'].apply(pd.to_datetime)
df['B'] = df['B'].apply(pd.to_datetime)
df["E"] = df.groupby("C")["D"].cumsum()
我想为每个C
组选择一行,具有以下条件:
E>=20
和B==11:00:00
的第一行开始,仅在每个A
组的第C
天申请。C
组的第一行。输出应为:
A B C D E
0 2002-01-12 2018-04-25 10:00:00 John 19 19
8 2002-01-26 2018-04-25 11:00:00 George -8 22
我试过了:
def eleven(g):
cond = g[g.B==time(11)].E.ge(20)
if cond.any():
return g[cond].iloc[0]
else:
return g.iloc[1]
r = df.groupby('C', as_index=False).apply(eleven)
答案 0 :(得分:1)
我认为需要更改条件,链条件用于比较E
,第二组用A
使用factorize
,第二组使用>0
:
def eleven(g):
cond = (g.B.dt.hour==11) & (g.E.ge(20) & pd.factorize(g.A)[0]) > 0
if cond.any():
return g[cond].iloc[0]
else:
return g.iloc[0]
r = df.groupby('C', as_index=False, sort=False).apply(eleven)
print (r)
A B C D E
0 2002-01-12 2018-04-25 10:00:00 John 19 19
1 2002-01-26 2018-04-25 11:00:00 George -8 22