Pandas,检查时间戳值是否在上一个时间戳的x分钟内

时间:2018-03-07 09:51:58

标签: python-3.x pandas timestamp pandas-groupby

我的数据框如下所示,带有用户ID,时间戳和歌曲名称。时间戳是用户开始播放歌曲的时间。定义会话,其中每首歌曲在前一首歌曲的开始时间的20分钟内开始。我需要创建前10个最长会话的列表,其中包含有关每个会话的以下信息:userid,会话中第一首和最后一首歌曲的时间戳,以及按播放顺序提供的会话中播放的歌曲列表。你能帮忙吗?

         user       timestamp          song
0        user_000001  05-05-09 12:08   The Start of Something
1        user_000001  04-05-09 14:54   My Sharona
2        user_000001  04-05-09 14:52   Caught by the river
3        user_000001  04-05-09 14:42   Swim
19       user_000001  03-05-09 15:56   Cover me
20       user_000001  03-05-09 15:50   Oh Holy Night
1048550  user_000050   25-01-07 8:51   I Hung My Head
1048551  user_000050   25-01-07 8:48   Slider
1048552  user_000050  24-01-07 22:57   Joy
1048553  user_000050  24-01-07 22:53   Crazy Eights
1048554  user_000050  24-01-07 22:48   Steady State
1048555  user_000050  24-01-07 22:42   Maple Leaves (7" Version)

2 个答案:

答案 0 :(得分:1)

在不改变数据顺序的情况下,我们可以执行以下操作:

import pandas as pd
from io import StringIO

data = StringIO('''id,user,timestamp,song
0,user_000001,05-05-09 12:08,The Start of Something
1,user_000001,04-05-09 14:54,My Sharona
2,user_000001,04-05-09 14:52,Caught by the river
3,user_000001,04-05-09 14:42,Swim
19,user_000001,03-05-09 15:56,Cover me
20,user_000001,03-05-09 15:50,Oh Holy Night
1048550,user_000050, 25-01-07 8:51,I Hung My Head
1048551,user_000050, 25-01-07 8:48,Slider
1048552,user_000050,24-01-07 22:57,Joy
1048553,user_000050,24-01-07 22:53,Crazy Eights
1048554,user_000050,24-01-07 22:48,Steady State
1048555,user_000050,24-01-07 22:42,Maple Leaves (7" Version)''')

def time_elapsed(grp, session_length):
    grp['MinsElapsed'] = (grp['timestamp'] - grp['timestamp'].shift(-1)) / pd.Timedelta(minutes=1)
    grp['Session'] = (grp['MinsElapsed'] > session_length)[::-1].astype(int).cumsum()[::-1]
    return grp


df = pd.read_csv(data)
df['timestamp'] = pd.to_datetime(df['timestamp'])

df = df.groupby('user').apply(time_elapsed, session_length=20)

print(df)

我们按用户分组,并在几分钟内计算出下面一行(.shift(-1))之间的时差。然后,我们检查此列是否返回大于会话长度的值,将其转换为整数并应用累积总和。由于时间按降序排列,要使其正常工作,我们必须在执行累积总和之前反转整个列,然后重置它。

这给了我们:

         id         user           timestamp                       song  MinsElapsed  Session 
0         0  user_000001 2009-05-05 12:08:00     The Start of Something      43034.0        2 
1         1  user_000001 2009-04-05 14:54:00                 My Sharona          2.0        1 
2         2  user_000001 2009-04-05 14:52:00        Caught by the river         10.0        1 
3         3  user_000001 2009-04-05 14:42:00                       Swim      44566.0        1 
4        19  user_000001 2009-03-05 15:56:00                   Cover me          6.0        0 
5        20  user_000001 2009-03-05 15:50:00              Oh Holy Night          NaN        0 
6   1048550  user_000050 2007-01-25 08:51:00             I Hung My Head          3.0        1 
7   1048551  user_000050 2007-01-25 08:48:00                     Slider        591.0        1 
8   1048552  user_000050 2007-01-24 22:57:00                        Joy          4.0        0 
9   1048553  user_000050 2007-01-24 22:53:00               Crazy Eights          5.0        0 
10  1048554  user_000050 2007-01-24 22:48:00               Steady State          6.0        0 
11  1048555  user_000050 2007-01-24 22:42:00  Maple Leaves (7" Version)          NaN        0 

修改

要获得会话中歌曲的第一次和最后一次以及会话的长度,我们可以执行以下操作:

session_length = df.groupby(['user', 'Session'])['timestamp'] \
                   .agg(['min', 'max']) \
                   .reset_index()

session_length['Length (mins)'] = (session_length['max'] -session_length['min']) / pd.Timedelta(minutes=1)

这给了我们:

          user  Session                 min                 max  Length (mins)
0  user_000001        0 2009-03-05 15:50:00 2009-03-05 15:56:00            6.0
1  user_000001        1 2009-04-05 14:42:00 2009-04-05 14:54:00           12.0
2  user_000001        2 2009-05-05 12:08:00 2009-05-05 12:08:00            0.0
3  user_000050        0 2007-01-24 22:42:00 2007-01-24 22:57:00           15.0
4  user_000050        1 2007-01-25 08:48:00 2007-01-25 08:51:00            3.0

答案 1 :(得分:0)

这有效,但我已按用户重新排序数据并按时间戳递增:

t=# do $$
declare
 r record;
 s text;
begin
for r in (select relnamespace::regnamespace nspname,relname from pg_class where relname like 't%' and relkind = 'r') loop
 execute format('select count(*) from %I.%I',r.nspname,r.relname) into s;
 raise info '%.%: %', r.nspname,r.relname, s;
end loop;
end;
$$
;
INFO:  postgres.tb: 1
INFO:  postgres.tt: 0
INFO:  public.tt: 0
INFO:  postgres.t3: 1
INFO:  postgres.testtable: 1
INFO:  a.tt: 0
INFO:  b.tt: 0
INFO:  postgres.tT: 0
INFO:  postgres.ta: 1
INFO:  postgres.t5: 1
INFO:  postgres.tb1: 1
INFO:  postgres.tb2: 1
INFO:  s1.t: 1
INFO:  s2.t: 1
INFO:  postgres.test: 1
INFO:  public.test: 6
INFO:  postgres.t: 9904
DO