Question

            A                   B       C   D   E
0  2002-01-12 2018-04-25 10:00:00    John  19  19
1  2002-01-12 2018-04-25 11:00:00    John   6  25
2  2002-01-13 2018-04-25 09:00:00    John   5  30
3  2002-01-13 2018-04-25 11:00:00    John -25   5
4  2002-01-14 2018-04-25 11:00:00    John   1   6
5  2002-01-14 2018-04-25 12:00:00    John  44  50
6  2002-01-25 2018-04-25 11:00:00  George  18  18
7  2002-01-25 2018-04-25 12:00:00  George  12  30
8  2002-01-26 2018-04-25 11:00:00  George  -8  22
9  2002-01-26 2018-04-25 12:00:00  George -10  12
10 2002-01-27 2018-04-25 10:00:00  George  13  25
11 2002-01-27 2018-04-25 11:00:00  George   1  26

df['A'] = df['A'].apply(pd.to_datetime)
df['B'] = df['B'].apply(pd.to_datetime)
df["E"] = df.groupby("C")["D"].cumsum()

我想为每个C组选择一行，具有以下条件：

在E>=20和B==11:00:00的第一行开始，仅在每个A组的第C天申请。
如果不存在满足该条件的任何行，请取C组的第一行。

输出应为：

            A                   B       C   D   E
0  2002-01-12 2018-04-25 10:00:00    John  19  19
8  2002-01-26 2018-04-25 11:00:00  George  -8  22

我试过了：

def eleven(g):
    cond = g[g.B==time(11)].E.ge(20)
    if cond.any():
        return g[cond].iloc[0]
    else:
        return g.iloc[1]

r = df.groupby('C', as_index=False).apply(eleven)

Answer 1

我认为需要更改条件，链条件用于比较E，第二组用A使用factorize，第二组使用>0：

def eleven(g):
    cond = (g.B.dt.hour==11) & (g.E.ge(20) & pd.factorize(g.A)[0]) > 0
    if cond.any():
        return g[cond].iloc[0]
    else:
        return g.iloc[0]

r = df.groupby('C', as_index=False, sort=False).apply(eleven)
print (r)
           A                   B       C   D   E
0 2002-01-12 2018-04-25 10:00:00    John  19  19
1 2002-01-26 2018-04-25 11:00:00  George  -8  22

基于每组天数的条件

1 个答案: