抱歉,如果我的标题没有正确描述我正在尝试执行的任务。
对于大学项目,我收到了一个网站的访问日志,我已经丢弃了不需要的列并将其浓缩为:
╔══════════╦══════════════════════╦═════════════════╦═════════════╦════════════════╗
║ accessid ║ date_time_in_seconds ║ yg_requester_id ║ referent_id ║ referent_docid ║
╠══════════╬══════════════════════╬═════════════════╬═════════════╬════════════════╣
║ 2449 ║ 2009011621830 ║ 32276 ║ 12648 ║ 1 ║
║ 2776 ║ 2009011622726 ║ 76360 ║ 11070 ║ 1 ║
║ 2804 ║ 2009011622783 ║ 32276 ║ 13845 ║ 1 ║
║ 2894 ║ 2009011623025 ║ 32276 ║ 7222 ║ 1 ║
║ 2895 ║ 2009011623037 ║ 32276 ║ 1530 ║ 1 ║
║ 3000 ║ 2009011623406 ║ 32276 ║ 3728 ║ 1 ║
║ 3019 ║ 2009011623497 ║ 520060 ║ 10356 ║ 1 ║
║ 3245 ║ 2009011625780 ║ 300841 ║ 4607 ║ 1 ║
║ 3274 ║ 2009011628309 ║ 532664 ║ 14377 ║ 1 ║
║ 3275 ║ 2009011628420 ║ 532664 ║ 9097 ║ 1 ║
╚══════════╩══════════════════════╩═════════════════╩═════════════╩════════════════╝
最初时间和日期戳是每单位测量的单独列(年,月,日,小时,分钟,秒),为了便于计算,我将它们合并到date_time_in_seconds,其格式为
[0000][00][00][00000]
[YEAR][MONTH][DAY][Number of Seconds since 00:00]
accessid是表条目ID,yg_requester_id是网站访问者的唯一ID,referent_id是他们读取的网站文章的ID,referent_docid表示文章的类型,但此任务不需要。
基本上,我希望能够找到时间差,因为同一个yg_requester_id访问了最后一个不同的referent_id。 例如,查看上表中的这部分行:
╔══════════╦══════════════════════╦═════════════════╦═════════════╦════════════════╗
║ accessid ║ date_time_in_seconds ║ yg_requester_id ║ referent_id ║ referent_docid ║
╠══════════╬══════════════════════╬═════════════════╬═════════════╬════════════════╣
║ 2449 ║ 2009011621830 ║ 32276 ║ 12648 ║ 1 ║
║ 2776 ║ 2009011622726 ║ 76360 ║ 11070 ║ 1 ║
║ 2804 ║ 2009011622783 ║ 32276 ║ 13845 ║ 1 ║
╚══════════╩══════════════════════╩═════════════════╩═════════════╩════════════════╝
yg_requester_id 32276在 06:03:50 (午夜后 21830 秒)访问了ID 12648 的文章)他们于2009年1月16日访问了 06:19:43 ( 22783 的 13845 文章。在2009年1月16日午夜后的几秒钟内。因此可以安全地假设用户阅读了第一篇文章(id 12648 )约15分50秒
我想要找到的是同一用户访问的文章之间的时差。用户读取的连续文章可能没有连续的accessid(尽管它总是递增)。我还想把读取的时间限制在大约一个小时,因为任务是过滤掉读取的时间 可变分钟数(例如15)的记录。
在此先感谢,如果需要更多信息,请与我们联系
答案 0 :(得分:2)
我会使用ROW_NUMBER对结果集进行yg_requester_id分区,并按accessid或datetime排序(假设您要将date_time_in_seconds列更改为常规日期时间列,如评论中所示。 然后我会通过请求者和前一个记录将结果集与自己一起加入,并获得差异。
让我尝试在没有正确数据的情况下编写查询:
SELECT X1.yg_requester_id, DATEDIFF(SECOND, X1.NewDateTimeField, X2.NewDateTimeField) AS TimeDifferenceInSeconds, X1.referent_id AS NewArticle, X2.referent_id AS FormerArticle
FROM
(
SELECT ROW_NUMBER() OVER(PARTITION BY yg_requester_id ORDER BY NewDateTimeField DESC) AS Position, NewDateTimeField, yg_requester_id, referent_id
FROM YourTable
) X1
INNER JOIN
(
SELECT ROW_NUMBER() OVER(PARTITION BY yg_requester_id ORDER BY NewDateTimeField DESC) AS Position, NewDateTimeField, yg_requester_id, referent_id
FROM YourTable
) X2 ON X2.yg_requester_id = X1.yg_requester_id AND X2.Position = X1.Position - 1
答案 1 :(得分:0)
此查询应检索请求者,指示对象以及请求者对指示对象所用的时间差(以秒为单位):
select abc.A_requestor as requestor_id,abc.B_refer as referent_id,abc.A_datetime-abc.B_datetime as time_difference from
(select a.accessid as A_accessid ,b.accessid as B_accessid,
a.yg_requestor_id as A_requestor,a.date_time_in_seconds as A_datetime,a.referent_id as A_refer,
b.yg_requestor_id as B_requestor,b.date_time_in_seconds as B_datetime,b.referent_id as B_refer
from weblog a
inner join weblog b
on a.yg_requestor_id = b.yg_requestor_id
and a.date_time_in_seconds > b.date_time_in_seconds
and a.referent_id != b.referent_id) abc
inner join
(select cte.B_accessid,min(cte.A_accessid) as C_accessid from
(select a.accessid as A_accessid ,b.accessid as B_accessid,
a.yg_requestor_id as A_requestor,a.date_time_in_seconds as A_datetime,a.referent_id as A_refer,
b.yg_requestor_id as B_requestor,b.date_time_in_seconds as B_datetime,b.referent_id as B_refer
from weblog a
inner join weblog b
on a.yg_requestor_id = b.yg_requestor_id
and a.date_time_in_seconds > b.date_time_in_seconds
and a.referent_id != b.referent_id) cte
group by cte.B_accessid ) xyz
on xyz.B_accessid = abc.B_accessid and xyz.C_accessid = abc.A_accessid