我有一个pandas dataframe
,如下所示:
userID timestamp other_data
1 2017-06-19 17:14:00.000 foo
1 2017-06-19 19:16:00.000 bar
1 2017-06-19 23:26:00.000 ter
1 2017-06-20 01:16:00.000 lol
2 2017-06-20 12:00:00.000 ter
2 2017-06-20 13:15:00.000 foo
2 2017-06-20 17:15:00.000 bar
我想添加两列time_since_previous_point
和time_until_next_point
,但当然只在每个用户的点之间添加。我现在不太关心单位/格式(只要我可以在它们之间轻松切换):
userID timestamp time_since_previous time_until_next other data
1 2017-06-19 17:14:00.000 02:02:00.000 foo
1 2017-06-19 19:16:00.000 02:02:00.000 04:10:00.000 bar
1 2017-06-19 23:26:00.000 04:10:00.000 01:50:00.000 ter
1 2017-06-20 01:16:00.000 01:50:00.000 lol
2 2017-06-20 12:00:00.000 01:15:00.000 ter
2 2017-06-20 13:15:00.000 01:15:00.000 04:00:00.000 foo
2 2017-06-20 17:15:00.000 04:00:00.000 bar
我该怎么做? (空单元格可以是empty
,NaN
,None
,具体取决于您最喜欢的内容,知道接下来,我将对{{1}进行描述性统计}和time_since_previous
)
请注意,在这里,我将time_until_next
表示为一列,但实际上,我识别用户的唯一方法是列的组合(userID
+ country
)
答案 0 :(得分:1)
我认为你缺少的是一个大熊猫shift
函数和这个答案:Pandas: Shift down values by one row within a group。
将两者结合在一起就可以这样做:
from io import StringIO
import pandas as pd
csv = """userID,timestamp,other_data
1,2017-06-19 17:14:00.000,foo
1,2017-06-19 19:16:00.000,bar
1,2017-06-19 23:26:00.000,ter
1,2017-06-20 01:16:00.000,lol
2,2017-06-20 12:00:00.000,ter
2,2017-06-20 13:15:00.000,foo
2,2017-06-20 17:15:00.000,bar
"""
df = pd.read_csv(StringIO(csv))
给出:
userID timestamp other_data
0 1 2017-06-19 17:14:00.000 foo
1 1 2017-06-19 19:16:00.000 bar
2 1 2017-06-19 23:26:00.000 ter
3 1 2017-06-20 01:16:00.000 lol
4 2 2017-06-20 12:00:00.000 ter
5 2 2017-06-20 13:15:00.000 foo
6 2 2017-06-20 17:15:00.000 bar
首先,您需要将timestamp
转换为datetime
列:
df['timestamp'] = pd.to_datetime(df.timestamp)
然后合并groupby
和shift
方法:
df['time_since_previous'] = df['timestamp'] - df.groupby('userID')['timestamp'].shift(1)
df['time_until_next'] = df.groupby('userID')['timestamp'].shift(-1) - df['timestamp']
最终,它会给你你想要的东西:
userID timestamp other_data time_since_previous time_until_next
0 1 2017-06-19 17:14:00 foo NaT 02:02:00
1 1 2017-06-19 19:16:00 bar 02:02:00 04:10:00
2 1 2017-06-19 23:26:00 ter 04:10:00 01:50:00
3 1 2017-06-20 01:16:00 lol 01:50:00 NaT
4 2 2017-06-20 12:00:00 ter NaT 01:15:00
5 2 2017-06-20 13:15:00 foo 01:15:00 04:00:00
6 2 2017-06-20 17:15:00 bar 04:00:00 NaT
唯一剩下的就是处理NaT
s。