我的数据框如下所示,带有用户ID,时间戳和歌曲名称。时间戳是用户开始播放歌曲的时间。定义会话,其中每首歌曲在前一首歌曲的开始时间的20分钟内开始。我需要创建前10个最长会话的列表,其中包含有关每个会话的以下信息:userid,会话中第一首和最后一首歌曲的时间戳,以及按播放顺序提供的会话中播放的歌曲列表。你能帮忙吗?
user timestamp song
0 user_000001 05-05-09 12:08 The Start of Something
1 user_000001 04-05-09 14:54 My Sharona
2 user_000001 04-05-09 14:52 Caught by the river
3 user_000001 04-05-09 14:42 Swim
19 user_000001 03-05-09 15:56 Cover me
20 user_000001 03-05-09 15:50 Oh Holy Night
1048550 user_000050 25-01-07 8:51 I Hung My Head
1048551 user_000050 25-01-07 8:48 Slider
1048552 user_000050 24-01-07 22:57 Joy
1048553 user_000050 24-01-07 22:53 Crazy Eights
1048554 user_000050 24-01-07 22:48 Steady State
1048555 user_000050 24-01-07 22:42 Maple Leaves (7" Version)
答案 0 :(得分:1)
在不改变数据顺序的情况下,我们可以执行以下操作:
import pandas as pd
from io import StringIO
data = StringIO('''id,user,timestamp,song
0,user_000001,05-05-09 12:08,The Start of Something
1,user_000001,04-05-09 14:54,My Sharona
2,user_000001,04-05-09 14:52,Caught by the river
3,user_000001,04-05-09 14:42,Swim
19,user_000001,03-05-09 15:56,Cover me
20,user_000001,03-05-09 15:50,Oh Holy Night
1048550,user_000050, 25-01-07 8:51,I Hung My Head
1048551,user_000050, 25-01-07 8:48,Slider
1048552,user_000050,24-01-07 22:57,Joy
1048553,user_000050,24-01-07 22:53,Crazy Eights
1048554,user_000050,24-01-07 22:48,Steady State
1048555,user_000050,24-01-07 22:42,Maple Leaves (7" Version)''')
def time_elapsed(grp, session_length):
grp['MinsElapsed'] = (grp['timestamp'] - grp['timestamp'].shift(-1)) / pd.Timedelta(minutes=1)
grp['Session'] = (grp['MinsElapsed'] > session_length)[::-1].astype(int).cumsum()[::-1]
return grp
df = pd.read_csv(data)
df['timestamp'] = pd.to_datetime(df['timestamp'])
df = df.groupby('user').apply(time_elapsed, session_length=20)
print(df)
我们按用户分组,并在几分钟内计算出下面一行(.shift(-1)
)之间的时差。然后,我们检查此列是否返回大于会话长度的值,将其转换为整数并应用累积总和。由于时间按降序排列,要使其正常工作,我们必须在执行累积总和之前反转整个列,然后重置它。
这给了我们:
id user timestamp song MinsElapsed Session
0 0 user_000001 2009-05-05 12:08:00 The Start of Something 43034.0 2
1 1 user_000001 2009-04-05 14:54:00 My Sharona 2.0 1
2 2 user_000001 2009-04-05 14:52:00 Caught by the river 10.0 1
3 3 user_000001 2009-04-05 14:42:00 Swim 44566.0 1
4 19 user_000001 2009-03-05 15:56:00 Cover me 6.0 0
5 20 user_000001 2009-03-05 15:50:00 Oh Holy Night NaN 0
6 1048550 user_000050 2007-01-25 08:51:00 I Hung My Head 3.0 1
7 1048551 user_000050 2007-01-25 08:48:00 Slider 591.0 1
8 1048552 user_000050 2007-01-24 22:57:00 Joy 4.0 0
9 1048553 user_000050 2007-01-24 22:53:00 Crazy Eights 5.0 0
10 1048554 user_000050 2007-01-24 22:48:00 Steady State 6.0 0
11 1048555 user_000050 2007-01-24 22:42:00 Maple Leaves (7" Version) NaN 0
修改强>
要获得会话中歌曲的第一次和最后一次以及会话的长度,我们可以执行以下操作:
session_length = df.groupby(['user', 'Session'])['timestamp'] \
.agg(['min', 'max']) \
.reset_index()
session_length['Length (mins)'] = (session_length['max'] -session_length['min']) / pd.Timedelta(minutes=1)
这给了我们:
user Session min max Length (mins)
0 user_000001 0 2009-03-05 15:50:00 2009-03-05 15:56:00 6.0
1 user_000001 1 2009-04-05 14:42:00 2009-04-05 14:54:00 12.0
2 user_000001 2 2009-05-05 12:08:00 2009-05-05 12:08:00 0.0
3 user_000050 0 2007-01-24 22:42:00 2007-01-24 22:57:00 15.0
4 user_000050 1 2007-01-25 08:48:00 2007-01-25 08:51:00 3.0
答案 1 :(得分:0)
这有效,但我已按用户重新排序数据并按时间戳递增:
t=# do $$
declare
r record;
s text;
begin
for r in (select relnamespace::regnamespace nspname,relname from pg_class where relname like 't%' and relkind = 'r') loop
execute format('select count(*) from %I.%I',r.nspname,r.relname) into s;
raise info '%.%: %', r.nspname,r.relname, s;
end loop;
end;
$$
;
INFO: postgres.tb: 1
INFO: postgres.tt: 0
INFO: public.tt: 0
INFO: postgres.t3: 1
INFO: postgres.testtable: 1
INFO: a.tt: 0
INFO: b.tt: 0
INFO: postgres.tT: 0
INFO: postgres.ta: 1
INFO: postgres.t5: 1
INFO: postgres.tb1: 1
INFO: postgres.tb2: 1
INFO: s1.t: 1
INFO: s2.t: 1
INFO: postgres.test: 1
INFO: public.test: 6
INFO: postgres.t: 9904
DO