Pandas DataFrame:Groupby列,按DateTime排序和按条件截断Group

时间:2019-10-23 14:48:25

标签: python pandas dataframe

我有一个看上去类似于的Pandas DataFrame:

import pandas as pd

df = pd.DataFrame([['a', '2018-09-30 00:03:00', 'that is a glove'],
                   ['b', '2018-09-30 00:04:00', 'this is a glove'],
                   ['b', '2018-09-30 00:09:00', 'she has ball'],
                   ['a', '2018-09-30 00:05:00', 'they have a ball'],
                   ['a', '2018-09-30 00:01:00', 'she has a shoe'],
                   ['c', '2018-09-30 00:04:00', 'I have a baseball'],
                   ['a', '2018-09-30 00:02:00', 'this is a hat'],
                   ['a', '2018-09-30 00:06:00', 'he has no helmet'],
                   ['b', '2018-09-30 00:11:00', 'he has no shoe'],
                   ['c', '2018-09-30 00:02:00', 'we have a hat'],
                   ['a', '2018-09-30 00:04:00', 'we have a baseball'],
                   ['c', '2018-09-30 00:06:00', 'they have no glove'],
                   ], 
                  columns=['id', 'time', 'equipment'])


   id                 time           equipment
0   a  2018-09-30 00:03:00     that is a glove
1   b  2018-09-30 00:04:00     this is a glove
2   b  2018-09-30 00:09:00        she has ball
3   a  2018-09-30 00:05:00    they have a ball
4   a  2018-09-30 00:01:00      she has a shoe
5   c  2018-09-30 00:04:00   I have a baseball
6   a  2018-09-30 00:02:00       this is a hat
7   a  2018-09-30 00:06:00    he has no helmet
8   b  2018-09-30 00:11:00      he has no shoe
9   c  2018-09-30 00:02:00       we have a hat
10  a  2018-09-30 00:04:00  we have a baseball
11  c  2018-09-30 00:06:00  they have no glove

我想做的是groupby id,然后在每个组中按time排序,然后返回直到包含该行的行的每一行。单词“球”。到目前为止,我可以进行分组和排序:

df.groupby('id').apply(lambda x: x.sort_values(['time'], ascending=True)).reset_index(drop=True)


   id                 time           equipment
0   a  2018-09-30 00:01:00      she has a shoe
1   a  2018-09-30 00:02:00       this is a hat
2   a  2018-09-30 00:03:00     that is a glove
3   a  2018-09-30 00:04:00  we have a baseball
4   a  2018-09-30 00:05:00    they have a ball
5   a  2018-09-30 00:06:00    he has no helmet
6   b  2018-09-30 00:04:00     this is a glove
7   b  2018-09-30 00:09:00        she has ball
8   b  2018-09-30 00:11:00      he has no shoe
9   c  2018-09-30 00:02:00       we have a hat
10  c  2018-09-30 00:04:00   I have a baseball
11  c  2018-09-30 00:06:00  they have no glove

但是,我希望输出看起来像这样:

   id                 time           equipment
0   a  2018-09-30 00:01:00      she has a shoe
1   a  2018-09-30 00:02:00       this is a hat
2   a  2018-09-30 00:03:00     that is a glove
3   a  2018-09-30 00:04:00  we have a baseball
4   a  2018-09-30 00:05:00    they have a ball
6   b  2018-09-30 00:04:00     this is a glove
7   b  2018-09-30 00:09:00        she has ball

请注意,由于组c没有包含单词“ ball”的行,因此没有返回行。 c组的单词“棒球”,但这不是我们要寻找的匹配项。同样,请注意,a组不会停在“棒球”行,因为我们停在了“球”行。从速度和内存两个角度来看,最有效的方法是什么?

2 个答案:

答案 0 :(得分:0)

继续您所做的事情:

new_df = df.groupby('id').apply(lambda x: x.sort_values(['time'], ascending=True)).reset_index(drop=True)

new_df["mask"] = new_df.groupby("id").apply(lambda x: x["equipment"].str.contains(r"\bball\b",regex=True)).reset_index(drop=True)

result = (new_df.groupby("id").apply(lambda x : x.iloc[:x.reset_index(drop=True)["mask"].
                                     idxmax()+1 if x["equipment"].str.contains(r"\bball\b",regex=True).any() else 0])
          .reset_index(drop=True).drop("mask",axis=1))

print (result)

#
  id                 time           equipment
0  a  2018-09-30 00:01:00      she has a shoe
1  a  2018-09-30 00:02:00       this is a hat
2  a  2018-09-30 00:03:00     that is a glove
3  a  2018-09-30 00:04:00  we have a baseball
4  a  2018-09-30 00:05:00    they have a ball
5  b  2018-09-30 00:04:00     this is a glove
6  b  2018-09-30 00:09:00        she has ball
7  d  2018-09-30 00:06:00       I have a ball

答案 1 :(得分:0)

这是我的方法:

# as the final expected output is sorted by id and time
# we start by doing so to the whole data
df = df.sort_values(['id','time'])

# mark the rows containing the word `ball`
has_ball = (df.equipment.str.contains(r'\bball\b') )

# cumulative number of rows with `ball` in the group
s = has_ball.groupby(df['id']).cumsum()

# there must be row with `ball`
valid_groups = has_ball.groupby(df['id']).transform('max')

print(df[valid_groups &
         (s.eq(0) |              # not containing `ball` before the first
         (s.eq(1) & has_ball)    # first row containing `ball`
         )
        ]  
     )

输出:

   id                time           equipment
4   a 2018-09-30 00:01:00      she has a shoe
6   a 2018-09-30 00:02:00       this is a hat
0   a 2018-09-30 00:03:00     that is a glove
10  a 2018-09-30 00:04:00  we have a baseball
3   a 2018-09-30 00:05:00    they have a ball
1   b 2018-09-30 00:04:00     this is a glove
2   b 2018-09-30 00:09:00        she has ball