我有一个大约350k行和12列点击流数据的数据框。下面是数据外观的简化摘录。对于每个设备,我想返回购买时间之后出现的所有行。
type_ deviceid campaign_ time
Click device_1 Campaign_1 11/16/16 14:07
Purchase device_1 Campaign_6 11/18/16 16:26
Click device_1 Campaign_5 11/19/16 14:17
Click device_1 Campaign_1 11/19/16 14:30
Click device_2 Campaign_4 11/6/16 7:00
Purchase device_2 Campaign_2 11/9/16 21:56
Click device_2 Campaign_2 11/10/16 5:17
Click device_2 Campaign_3 11/12/16 19:19
我尝试使用.loc
来提取我需要的结果,但无济于事。任何人都可以指出我正确的方向或让我知道我需要做什么?
答案 0 :(得分:2)
首先,定义一个函数来过滤每个组中的行,例如
def after_purchase(rows):
# boolean mask indicating rows which are purchases
is_purchase = rows.type_ == 'Purchase'
# select timestamps from all purchases
purchase_times = rows.loc[is_purchase, 'time']
# grab the first (earliest) purchase timestamp
first_purchase_time = purchase_times.min()
# return all rows which occurred after the first purchase
return rows.loc[rows.time > first_purchase_time]
然后,按设备ID对数据框进行分组,并将该功能应用于每个组。
df.groupby('deviceid').apply(after_purchase)
type_ deviceid campaign_ time
deviceid
device_1 2 Click device_1 Campaign_5 2016-11-19 14:17:00
3 Click device_1 Campaign_1 2016-11-19 14:30:00
device_2 6 Click device_2 Campaign_2 2016-11-10 05:17:00
7 Click device_2 Campaign_3 2016-11-12 19:19:00