我有一个看上去类似于的Pandas DataFrame:
import pandas as pd
df = pd.DataFrame([['a', '2018-09-30 00:03:00', 'that is a glove'],
['b', '2018-09-30 00:04:00', 'this is a glove'],
['b', '2018-09-30 00:09:00', 'she has ball'],
['a', '2018-09-30 00:05:00', 'they have a ball'],
['a', '2018-09-30 00:01:00', 'she has a shoe'],
['c', '2018-09-30 00:04:00', 'I have a baseball'],
['a', '2018-09-30 00:02:00', 'this is a hat'],
['a', '2018-09-30 00:06:00', 'he has no helmet'],
['b', '2018-09-30 00:11:00', 'he has no shoe'],
['c', '2018-09-30 00:02:00', 'we have a hat'],
['a', '2018-09-30 00:04:00', 'we have a baseball'],
['c', '2018-09-30 00:06:00', 'they have no glove'],
],
columns=['id', 'time', 'equipment'])
id time equipment
0 a 2018-09-30 00:03:00 that is a glove
1 b 2018-09-30 00:04:00 this is a glove
2 b 2018-09-30 00:09:00 she has ball
3 a 2018-09-30 00:05:00 they have a ball
4 a 2018-09-30 00:01:00 she has a shoe
5 c 2018-09-30 00:04:00 I have a baseball
6 a 2018-09-30 00:02:00 this is a hat
7 a 2018-09-30 00:06:00 he has no helmet
8 b 2018-09-30 00:11:00 he has no shoe
9 c 2018-09-30 00:02:00 we have a hat
10 a 2018-09-30 00:04:00 we have a baseball
11 c 2018-09-30 00:06:00 they have no glove
我想做的是groupby
id
,然后在每个组中按time
排序,然后返回直到包含该行的行的每一行。单词“球”。到目前为止,我可以进行分组和排序:
df.groupby('id').apply(lambda x: x.sort_values(['time'], ascending=True)).reset_index(drop=True)
id time equipment
0 a 2018-09-30 00:01:00 she has a shoe
1 a 2018-09-30 00:02:00 this is a hat
2 a 2018-09-30 00:03:00 that is a glove
3 a 2018-09-30 00:04:00 we have a baseball
4 a 2018-09-30 00:05:00 they have a ball
5 a 2018-09-30 00:06:00 he has no helmet
6 b 2018-09-30 00:04:00 this is a glove
7 b 2018-09-30 00:09:00 she has ball
8 b 2018-09-30 00:11:00 he has no shoe
9 c 2018-09-30 00:02:00 we have a hat
10 c 2018-09-30 00:04:00 I have a baseball
11 c 2018-09-30 00:06:00 they have no glove
但是,我希望输出看起来像这样:
id time equipment
0 a 2018-09-30 00:01:00 she has a shoe
1 a 2018-09-30 00:02:00 this is a hat
2 a 2018-09-30 00:03:00 that is a glove
3 a 2018-09-30 00:04:00 we have a baseball
4 a 2018-09-30 00:05:00 they have a ball
6 b 2018-09-30 00:04:00 this is a glove
7 b 2018-09-30 00:09:00 she has ball
请注意,由于组c
没有包含单词“ ball”的行,因此没有返回行。 c
组的单词“棒球”,但这不是我们要寻找的匹配项。同样,请注意,a
组不会停在“棒球”行,因为我们停在了“球”行。从速度和内存两个角度来看,最有效的方法是什么?
答案 0 :(得分:0)
继续您所做的事情:
new_df = df.groupby('id').apply(lambda x: x.sort_values(['time'], ascending=True)).reset_index(drop=True)
new_df["mask"] = new_df.groupby("id").apply(lambda x: x["equipment"].str.contains(r"\bball\b",regex=True)).reset_index(drop=True)
result = (new_df.groupby("id").apply(lambda x : x.iloc[:x.reset_index(drop=True)["mask"].
idxmax()+1 if x["equipment"].str.contains(r"\bball\b",regex=True).any() else 0])
.reset_index(drop=True).drop("mask",axis=1))
print (result)
#
id time equipment
0 a 2018-09-30 00:01:00 she has a shoe
1 a 2018-09-30 00:02:00 this is a hat
2 a 2018-09-30 00:03:00 that is a glove
3 a 2018-09-30 00:04:00 we have a baseball
4 a 2018-09-30 00:05:00 they have a ball
5 b 2018-09-30 00:04:00 this is a glove
6 b 2018-09-30 00:09:00 she has ball
7 d 2018-09-30 00:06:00 I have a ball
答案 1 :(得分:0)
这是我的方法:
# as the final expected output is sorted by id and time
# we start by doing so to the whole data
df = df.sort_values(['id','time'])
# mark the rows containing the word `ball`
has_ball = (df.equipment.str.contains(r'\bball\b') )
# cumulative number of rows with `ball` in the group
s = has_ball.groupby(df['id']).cumsum()
# there must be row with `ball`
valid_groups = has_ball.groupby(df['id']).transform('max')
print(df[valid_groups &
(s.eq(0) | # not containing `ball` before the first
(s.eq(1) & has_ball) # first row containing `ball`
)
]
)
输出:
id time equipment
4 a 2018-09-30 00:01:00 she has a shoe
6 a 2018-09-30 00:02:00 this is a hat
0 a 2018-09-30 00:03:00 that is a glove
10 a 2018-09-30 00:04:00 we have a baseball
3 a 2018-09-30 00:05:00 they have a ball
1 b 2018-09-30 00:04:00 this is a glove
2 b 2018-09-30 00:09:00 she has ball