如何选择首先出现列中某个值之前的行?
我有一个用户活动数据集,其时间戳记录如下:
df = pd.DataFrame([{'user_id':1, 'date':'2017-09-01', 'activity':'Open'},
{'user_id':1, 'date':'2017-09-02', 'activity':'Open'}
{'user_id':1, 'date':'2017-09-03', 'activity':'Open'}
{'user_id':1, 'date':'2017-09-04', 'activity':'Click'}
{'user_id':1, 'date':'2017-09-05', 'activity':'Purchase'}
{'user_id':1, 'date':'2017-09-06', 'activity':'Open'}
{'user_id':1, 'date':'2017-09-07', 'activity':'Open'}
{'user_id':2, 'date':'2017-09-04', 'activity':'Open'}
{'user_id':2, 'date':'2017-09-06', 'activity':'Purchase'})]
有没有办法从数据框中为每个用户选择第一次购买之前发生的所有行?在此示例中,期望输出将是
df = pd.DataFrame([{'user_id':1, 'date':'2017-09-01', 'activity':'Open'},
{'user_id':1, 'date':'2017-09-02', 'activity':'Open'}
{'user_id':1, 'date':'2017-09-03', 'activity':'Open'}
{'user_id':1, 'date':'2017-09-04', 'activity':'Click'}
{'user_id':2, 'date':'2017-09-04', 'activity':'Open'})]
答案 0 :(得分:3)
使用groupby
并查找用户购买某个项目的行上方的所有行。然后,使用掩码进行索引。
df
activity date user_id
0 Open 2017-09-01 1
1 Open 2017-09-02 1
2 Open 2017-09-03 1
3 Click 2017-09-04 1
4 Purchase 2017-09-05 1
5 Open 2017-09-06 1
6 Open 2017-09-07 1
7 Open 2017-09-04 2
8 Purchase 2017-09-06 2
m = df.groupby('user_id').activity\
.apply(lambda x: (x == 'Purchase').cumsum()) == 0
df[m]
activity date user_id
0 Open 2017-09-01 1
1 Open 2017-09-02 1
2 Open 2017-09-03 1
3 Click 2017-09-04 1
7 Open 2017-09-04 2
如果您的实际数据不像此处那样排序,您可以使用df.sort_values
并确保它是:
df = df.sort_values(['user_id', 'date'])
答案 1 :(得分:3)
您可以避免明确申请
In [2862]: df[df['activity'].eq('Purchase').groupby(df['user_id']).cumsum().eq(0)]
Out[2862]:
activity date user_id
0 Open 2017-09-01 1
1 Open 2017-09-02 1
2 Open 2017-09-03 1
3 Click 2017-09-04 1
7 Open 2017-09-04 2
答案 2 :(得分:1)
mask
使用groupby
DataFrameGroupBy.cumsum
,转换为bool
,反转条件并按boolean indexing
过滤:
#if necessary
#df = df.sort_values(['user_id', 'date'])
df = df[~df['activity'].eq('Purchase').groupby(df['user_id']).cumsum().astype(bool)]
print (df)
user_id date activity
0 1 2017-09-01 Open
1 1 2017-09-02 Open
2 1 2017-09-03 Open
3 1 2017-09-04 Click
7 2 2017-09-04 Open
详情:
print (~df['activity'].eq('Purchase').groupby(df['user_id']).cumsum().astype(bool))
0 True
1 True
2 True
3 True
4 False
5 False
6 False
7 True
8 False
Name: activity, dtype: bool