我有点击日志,如下所示:
userID time URL
1 2011-03-1 12:30:01 abc.com
2 2011-03-1 12:30:04 xyz.com
1 2011-03-1 12:30:46 abc.com/new
2 2011-03-1 12:31:02 xyz.com/fun
2 2011-03-1 12:36:08 xyz.com/funner
1 2011-03-1 12:45:46 abc.com/newer
我想将此转换为按会话组织的clickpath数据(定义为自该用户的最后一次点击后10分钟间隔后开始的任何一系列点击),因为我想运行Clickpath分析。这是预期的结果:
userID sessionStart clicktime Seconds fromPage toPage
1 2011-03-1 12:30:01 2011-03-1 12:30:01 NULL NULL abc.com
1 2011-03-1 12:30:01 2011-03-1 12:30:46 45 abc.com abc.com/new
1 2011-03-1 12:30:01 NULL NULL abc.com/new NULL
1 2011-03-1 12:45:46 2011-03-1 12:45:46 NULL NULL abc.com/newer
1 2011-03-1 12:45:46 NULL NULL abc.com/newer NULL
2 2011-03-1 12:30:04 2011-03-1 12:30:04 NULL NULL xyz.com
2 2011-03-1 12:30:04 2011-03-1 12:31:02 58 xyz.com xyz.com/fun
2 2011-03-1 12:30:04 2011-03-1 12:36:08 306 xyz.com/fun xyz.com/funner
2 2011-03-1 12:30:04 NULL NULL xyz.com/funner NULL
请注意,由于第二次和第三次点击之间的间隔超过10分钟,用户1会有两个不同的会话。
我以为我使用Hive的windowing features从版本0.11找到了解决方案,但我正在使用版本0.10,所以现在我被卡住了。
答案 0 :(得分:0)
我认为您可以使用Hive transform函数和自定义缩减程序脚本。 您必须确保具有相同user_id的所有行都由同一个reducer使用distribute by function处理,并且使用sort by函数以升序日期顺序发送这些行
ADD FILE hdfs:///path/to/your/scripts/reducer_script.py ;
create table clickStream as
select
transform (a.user_id, a.time , a.url)
USING 'reducer_sessionizer.py' as (user_id, time, url, fromPage, toPage)
from (select user_id, time, url from rawData distribute by user_id sort by time ) a ;
你的脚本,例如在python中,将逐行读取你的数据集,你将处理引用键更改的数据:
sessionDuration = 10 minutes
for line_out in sys.stdin:
str = []
line_split = line_out.strip().split('\t')
# line_split = [userId, Time, Url]
# check the duration since last action, if above the sessionDuration, we create a new session Id
# check the user (the key) is still the same too, else, we create a new session Id
# we store the userId to compare on next iteration
if (line_split[1] - prev_time > sessionDuration OR prev_user != line_split[0]) :
sid = uuid.uuid4().hex
prev_url = "Null"
sess_start = line_split[1]
else :
pass
str.append(line_split[0]) # userId
str.append('\t')
str.append(line_split[2]) # toPage
str.append('\t')
str.append(sess_start) # session start time
str.append('\t')
str.append(prev_url) # fromPage
print "".join(str)
# for next iteration, we keep the previous userId, url and time
prev_user = line_split[0]
prev_time = line_split[1]
prev_url = line_split[2]
(我真的不是开发者,所以请考虑伪代码,我让你添加日期处理)