SQL查询从日志表计算访问持续时间的一部分

时间:2009-10-23 13:28:57

标签: sql sql-server sql-server-2005 duration

我有一个表,每次加载网页时都会记录用户ID,课程,会话ID和请求日期。 我想计算给定courseid的每个用户ID的持续时间。 由于时间跨度重叠,这样做是有问题的。

此处提供的数据应该导致课程1的每个用户持续10分钟。 我似乎无法做到这一点。

CREATE TABLE PageLogSample (
    id INT NOT NULL PRIMARY KEY IDENTITY
,   userid INT
,   courseid INT
,   sessionid INT
,   requestdate DATETIME
);

TRUNCATE TABLE PageLogSample;

INSERT INTO PageLogSample (userid, courseid, sessionid, requestdate)
-- [0, 10] = 10 minutes
          SELECT 1, 1, 1, '00:00:00'
UNION ALL SELECT 1, 1, 1, '00:10:00'
-- [0, 12] - [3, 5] = 10 minutes
-- or ... [0, 3] + [5, 12] = 10 minutes
UNION ALL SELECT 2, 1, 2, '00:00:00'
UNION ALL SELECT 2, 2, 2, '00:03:00'
UNION ALL SELECT 2, 2, 2, '00:05:00'
UNION ALL SELECT 2, 1, 2, '00:12:00'
-- [0, 12] - [3, 5] = 10 minutes
-- or ... [0, 3] + [5, 12] = 10 minutes
UNION ALL SELECT 3, 1, 3, '00:00:00'
UNION ALL SELECT 3, 2, 3, '00:03:00'
UNION ALL SELECT 3, 2, 3, '00:05:00'
UNION ALL SELECT 3, 1, 3, '00:12:00'
UNION ALL SELECT 3, 2, 3, '00:15:00'
-- [1, 13] - [3, 5] = 10 minutes
-- or ... [1, 3] + [5, 13] = 10 minutes
UNION ALL SELECT 4, 2, 4, '00:00:00'
UNION ALL SELECT 4, 1, 4, '00:01:00'
UNION ALL SELECT 4, 2, 4, '00:03:00'
UNION ALL SELECT 4, 2, 4, '00:05:00'
UNION ALL SELECT 4, 1, 4, '00:13:00'
UNION ALL SELECT 4, 2, 4, '00:15:00'
-- [0, 5] + [10, 15] = 10 minutes
UNION ALL SELECT 5, 1, 5, '00:00:00'
UNION ALL SELECT 5, 1, 5, '00:05:00'
UNION ALL SELECT 5, 1, 6, '00:10:00'
UNION ALL SELECT 5, 1, 6, '00:15:00'
-- [0, 10] = 10 minutes (ignoring everything inbetween)
UNION ALL SELECT 6, 1, 7, '00:00:00'
UNION ALL SELECT 6, 1, 7, '00:03:00'
UNION ALL SELECT 6, 1, 7, '00:05:00'
UNION ALL SELECT 6, 1, 7, '00:07:00'
UNION ALL SELECT 6, 1, 7, '00:10:00'
-- [0, 11] - [5, 6] = 10 minutes
-- or ... [0, 3] + [7, 11] = 6 minutes (good)
-- or ... [0, 5] + [7, 11] = 9 minutes (better)
UNION ALL SELECT 7, 1, 8, '00:00:00'
UNION ALL SELECT 7, 1, 8, '00:03:00'
UNION ALL SELECT 7, 2, 8, '00:05:00'
UNION ALL SELECT 7, 2, 8, '00:06:00'
UNION ALL SELECT 7, 1, 8, '00:07:00'
UNION ALL SELECT 7, 1, 8, '00:11:00'
-- [0, 1] + [2, 4] + [5, 7] + [8, 13] = 10
UNION ALL SELECT 8, 1, 9, '00:00:00'
UNION ALL SELECT 8, 2, 9, '00:01:00'
UNION ALL SELECT 8, 1, 9, '00:02:00'
UNION ALL SELECT 8, 1, 9, '00:03:00'
UNION ALL SELECT 8, 2, 9, '00:04:00'
UNION ALL SELECT 8, 1, 9, '00:05:00'
UNION ALL SELECT 8, 1, 9, '00:06:00'
UNION ALL SELECT 8, 2, 9, '00:07:00'
UNION ALL SELECT 8, 1, 9, '00:08:00'
UNION ALL SELECT 8, 1, 9, '00:13:00'
;

首先尝试天真的方法。这会导致会话重叠部分出错。

DECLARE @courseid INT;
SET @courseid = 1;

SELECT subquery.userid
, COUNT(DISTINCT subquery.sessionid) AS sessioncount
, SUM(subquery.duration) AS duration
, CASE SUM(subquery.duration) 
    WHEN 10 THEN 'ok' 
    ELSE 'ERROR' 
END
FROM (
    SELECT userid
    , sessionid
    , DATEDIFF(MINUTE, MIN(requestdate), MAX(requestdate)) AS duration
    FROM PageLogSample
    WHERE courseid = @courseid
    GROUP BY userid
    , sessionid
) subquery
GROUP BY subquery.userid
ORDER BY subquery.userid;

-- userid  sessioncount  duration   
-- 1       1             10       ok
-- 2       1             12       ERROR
-- 3       1             12       ERROR
-- 4       1             12       ERROR
-- 5       2             10       ok

第二次尝试。避免重叠。这只能部分起作用。

DECLARE @courseid INT;
SET @courseid = 1;

WITH cte (userid, courseid, sessionid, start, finish, duration)
AS (
    SELECT userid
    , courseid
    , sessionid
    , MIN(requestdate)
    , MAX(requestdate)
    , DATEDIFF(MINUTE, MIN(requestdate), MAX(requestdate))
    FROM PageLogSample
    GROUP BY userid
    , courseid
    , sessionid
)
SELECT naive.userid
, naive.sessioncount
, naive.duration AS naiveduration
, correction.duration AS correctionduration
, naive.duration - ISNULL(correction.duration, 0) AS duration
, CASE naive.duration - ISNULL(correction.duration, 0)
    WHEN 10 THEN 'ok' 
    ELSE 'ERROR' 
END
FROM (
    SELECT cte.userid
    , COUNT(DISTINCT cte.sessionid) AS sessioncount
    , SUM(cte.duration) AS duration
    FROM cte
    WHERE cte.courseid = @courseid
    GROUP BY cte.userid
) naive
LEFT JOIN (
    SELECT errors.userid
    , SUM(errors.duration) AS duration
    FROM cte errors
    WHERE errors.courseid <> @courseid
    AND EXISTS (
        SELECT *
        FROM cte
        WHERE cte.start <= errors.start
        AND cte.finish >= errors.finish
        AND cte.courseid = @courseid
    )
    GROUP BY errors.userid
) correction
ON naive.userid = correction.userid
;

-- userid  sessioncount  naiveduration  correctionduration  duration
-- 1       1             10             NULL                10        ok
-- 2       1             12             2                   10        ok
-- 3       1             12             NULL                12        ERROR
-- 4       1             12             NULL                12        ERROR
-- 5       2             10             NULL                10        ok

更新 Ed Harpers comment真的让我重新思考我的方法。

所以这是第三次试验。在这里,我首先搜索哪些行代表课程的入口,哪些代表某人离开。然后我取所有结束时间的总和并减去所有开始时间的总和。我认为它更正确,但并不完美。

DECLARE @courseid INT;
SET @courseid = 1;

WITH numberedcte (rn, id, userid, courseid, sessionid, requestdate)
AS (
    SELECT ROW_NUMBER() OVER (PARTITION BY sessionid, userid ORDER BY id)
    , id
    , userid
    , courseid
    , sessionid
    , requestdate
    FROM PageLogSample
)
, typedcte (rowtype, id, userid, courseid, sessionid, requestdate, nextrequestdate)
AS (
    SELECT CASE
        WHEN previousrequest.courseid = nextrequest.courseid
            THEN 'between'
        WHEN previousrequest.courseid IS NULL
            OR nextrequest.courseid = numberedcte.courseid
            THEN 'begin'
        WHEN nextrequest.courseid IS NULL
            OR previousrequest.courseid = numberedcte.courseid
            THEN 'end'
        ELSE 'error?'
    END AS rowtype
    , numberedcte.id
    , numberedcte.userid
    , numberedcte.courseid
    , numberedcte.sessionid
    , numberedcte.requestdate
    , nextrequest.requestdate
    FROM numberedcte
    LEFT JOIN numberedcte previousrequest
        ON previousrequest.userid = numberedcte.userid
        AND previousrequest.sessionid = numberedcte.sessionid
        AND previousrequest.rn = numberedcte.rn - 1
    LEFT JOIN numberedcte nextrequest
        ON nextrequest.userid = numberedcte.userid
        AND nextrequest.sessionid = numberedcte.sessionid
        AND nextrequest.rn = numberedcte.rn + 1
    WHERE numberedcte.courseid = @courseid
    AND (
        nextrequest.courseid = @courseid
        OR previousrequest.courseid = @courseid
    )
)
, beginsum (userid, value)
AS (
    SELECT userid, SUM(DATEPART(MINUTE, requestdate))
    FROM typedcte
    WHERE rowtype = 'begin'
    GROUP BY userid
)
, endsum (userid, value)
AS (
    SELECT userid, SUM(DATEPART(MINUTE, ISNULL(nextrequestdate, requestdate)))
    FROM typedcte
    WHERE rowtype = 'end'
    GROUP BY userid
)
SELECT beginsum.userid
, endsum.value - beginsum.value AS duration
FROM beginsum
INNER JOIN endsum
    ON beginsum.userid = endsum.userid
;

这里唯一的问题是我只从原始样本数据中获得用户1和5的输出。添加的用户6也提供正确的输出。添加的用户7现在给我一个满意的输出。用户8几乎是完美的,我从第一行到第二行错过了一分钟。

-- userid  duration
-- 1       10
-- 5       10
-- 6       10
-- 7       9
-- 8       9

我觉得我距离完全正确的距离还有几英寸远。缺少的唯一持续时间来自未在组中发生的页面请求。有人可以帮我找到一种方法来获取孤独的综合浏览量吗?

更新 这是第四次试验。在这里,我为每个请求分配一个值并总结它们。它并没有给我提供我希望的输出,但看起来它可能已经足够好了。

DECLARE @courseid INT;
SET @courseid = 1;

WITH numberedcte (rn, userid, courseid, sessionid, requestdate)
AS (
    SELECT ROW_NUMBER() OVER (PARTITION BY sessionid, userid ORDER BY id)
    , userid
    , courseid
    , sessionid
    , requestdate
    FROM PageLogSample
)
, valuecte (value, userid, courseid, sessionid)
AS (
    SELECT CASE
        --alone
        WHEN ( previousrequest.courseid IS NULL
            OR previousrequest.courseid <> numberedcte.courseid
            )
            AND nextrequest.courseid <> numberedcte.courseid
            THEN DATEDIFF(MINUTE, numberedcte.requestdate, nextrequest.requestdate)
        --between
        WHEN previousrequest.courseid = nextrequest.courseid
            THEN 0
        --begin
        WHEN previousrequest.courseid IS NULL
            OR nextrequest.courseid = numberedcte.courseid
            THEN -1 * DATEPART(MINUTE, numberedcte.requestdate)
        --ignored (end with no next request)
        WHEN nextrequest.courseid IS NULL
            AND previousrequest.courseid <> numberedcte.courseid
            THEN 0
        --end
        WHEN nextrequest.courseid IS NULL
            OR previousrequest.courseid = numberedcte.courseid
            THEN DATEPART(MINUTE, ISNULL(nextrequest.requestdate, numberedcte.requestdate))
        --impossible?
        ELSE 0
    END
    , numberedcte.userid
    , numberedcte.courseid
    , numberedcte.sessionid
    FROM numberedcte
    LEFT JOIN numberedcte previousrequest
        ON previousrequest.userid = numberedcte.userid
        AND previousrequest.sessionid = numberedcte.sessionid
        AND previousrequest.rn = numberedcte.rn - 1
    LEFT JOIN numberedcte nextrequest
        ON nextrequest.userid = numberedcte.userid
        AND nextrequest.sessionid = numberedcte.sessionid
        AND nextrequest.rn = numberedcte.rn + 1
    WHERE numberedcte.courseid = @courseid
)
SELECT userid
, courseid
, COUNT(DISTINCT sessionid) AS sessioncount
, SUM(value) AS duration
FROM valuecte
GROUP BY userid
, courseid
ORDER BY userid
;

正如您所看到的,结果并非完全符合我的预期。

-- userid  courseid  sessioncount  duration
-- 1       1         1             10
-- 2       1         1              3
-- 3       1         1              6
-- 4       1         1              4
-- 5       1         2             10
-- 6       1         1             10
-- 7       1         1              9
-- 8       1         1             10

在我的真实数据库的本地副本上,性能非常糟糕。因此,如果有人有想法以更高效的方式写这个......拍摄。

更新 表现上升。我添加了一个索引,它现在起了作用。

5 个答案:

答案 0 :(得分:0)

抱歉,我认为您遇到了数据问题。查看提供的样本数据,用户2在1中持续12分钟,在2中持续2分钟。

您确定提供了正确的数据吗?

答案 1 :(得分:0)

这是我能得到的尽可能接近。用户ID 4失败。

正如我在评论中所说,requestdate有时是一个开始,有时是课程的结束,我看不出一个简单的一般规则来推导出它在给定行上扮演的角色。

DECLARE @courseid INT;
SET @courseid = 1;

WITH orderCTE
AS
(
        SELECT *

               ,ROW_NUMBER() OVER (PARTITION BY sessionid
                                   ORDER BY id
                                  ) AS rn
        FROM PageLogSample
        --order by rn
)
,startendCTE
AS
(
        SELECT  CASE WHEN start1.rn = 1
                     THEN start1.courseid
                     ELSE end1.courseid
                 END courseid
                ,start1.sessionid
                ,start1.userid
                ,DATEDIFF(mi,start1.requestdate,end1.requestdate) duration
        FROM orderCTE AS start1
        JOIN orderCTE AS end1
        ON end1.rn = start1.rn + 1
        AND end1.sessionid = start1.sessionid
)
SELECT courseid
       ,COUNT(1) sessionCount
       ,userid
       ,SUM(duration) totalDuration
FROM startendCTE
WHERE courseid = @courseid
GROUP BY courseid
         ,userid;

答案 2 :(得分:0)

这非常混乱,但它似乎适用于CourseID 1.我没有尝试过其他课程,所以你可能想测试一下! :d

基本前提是我正在获取目标CourseID的第一个和最后一个会话之间的持续时间,然后我减去任何不是指定CourseID但会话请求的会话的持续时间时间落在目标CourseID的最小和最大请求​​时间内。我希望这是有道理的。

绝对可以清除查询,可能使用CTE或其他东西。有趣的问题BTW! :)

DECLARE @courseid INT;
SET @courseid = 1;

SELECT 
    TargetCourse.UserID, 
    COUNT(Distinct(TargetCourse.SessionID)) as SessionCount,
    SUM(TargetCourse.Duration - Coalesce(OtherCourses.Duration,0)) as Duration
FROM
(
    SELECT 
        TargetCourse.UserID, TargetCourse.SessionID, 
        MIN(TargetCourse.RequestDate) FirstRequest, MAX(TargetCourse.RequestDate) LastRequest, 
        DATEDIFF(MINUTE, MIN(TargetCourse.RequestDate), MAX(TargetCourse.RequestDate)) AS duration
    FROM 
        PageLogSample TargetCourse
    WHERE
        TargetCourse.CourseID = @courseid
    GROUP BY
        TargetCourse.UserID, TargetCourse.SessionID     
) as TargetCourse
LEFT OUTER JOIN
(
    SELECT 
        OtherCourses.UserID, OtherCourses.SessionID, 
        MIN(OtherCourses.RequestDate) AS FirstRequest, MAX(OtherCourses.RequestDate) AS LastRequest, 
        DATEDIFF(MINUTE, MIN(OtherCourses.RequestDate), MAX(OtherCourses.RequestDate)) AS duration
    FROM 
        PageLogSample OtherCourses
    WHERE
        OtherCourses.CourseID <> @courseid AND
        OtherCourses.RequestDate between
            (Select MIN(RequestDate) From PageLogSample T Where T.UserID = OtherCourses.UserID and T.CourseID = @courseid) AND
            (Select MAX(RequestDate) From PageLogSample T Where T.UserID = OtherCourses.UserID and T.CourseID = @courseid)
    GROUP BY
        OtherCourses.UserID, OtherCourses.SessionID 
) as OtherCourses ON
OtherCourses.UserID = TargetCourse.UserID AND
OtherCourses.FirstRequest BETWEEN TargetCourse.FirstRequest and TargetCourse.LastRequest
Group By TargetCourse.UserID

答案 3 :(得分:0)

更多样本数据以及每个用户在每门课程中花费的时间的逻辑假设。

INSERT INTO PageLogSample (userid, courseid, sessionid, requestdate)
-- [0, 10] = 10 minutes
          SELECT 1, 1, 1, '00:00:00'
UNION ALL SELECT 1, 1, 1, '00:10:00'
-- [0, 3] = 3 minutes
-- there is no way to know how long the user was on that last page
UNION ALL SELECT 2, 1, 2, '00:00:00'
UNION ALL SELECT 2, 2, 2, '00:03:00'
UNION ALL SELECT 2, 2, 2, '00:05:00'
UNION ALL SELECT 2, 1, 2, '00:12:00'
-- [0, 3] + [12, 15] = 6 minutes
-- the [5, 12] part was spent on a page of course 2
UNION ALL SELECT 3, 1, 3, '00:00:00'
UNION ALL SELECT 3, 2, 3, '00:03:00'
UNION ALL SELECT 3, 2, 3, '00:05:00'
UNION ALL SELECT 3, 1, 3, '00:12:00'
UNION ALL SELECT 3, 2, 3, '00:15:00'
-- [1, 3] + [13, 15] = 4 minutes
UNION ALL SELECT 4, 2, 4, '00:00:00'
UNION ALL SELECT 4, 1, 4, '00:01:00'
UNION ALL SELECT 4, 2, 4, '00:03:00'
UNION ALL SELECT 4, 2, 4, '00:05:00'
UNION ALL SELECT 4, 1, 4, '00:13:00'
UNION ALL SELECT 4, 2, 4, '00:15:00'
-- [0, 5] + [10, 15] = 10 minutes
UNION ALL SELECT 5, 1, 5, '00:00:00'
UNION ALL SELECT 5, 1, 5, '00:05:00'
UNION ALL SELECT 5, 1, 6, '00:10:00'
UNION ALL SELECT 5, 1, 6, '00:15:00'
-- [0, 10] = 10 minutes (ignoring everything inbetween)
UNION ALL SELECT 6, 1, 7, '00:00:00'
UNION ALL SELECT 6, 1, 7, '00:03:00'
UNION ALL SELECT 6, 1, 7, '00:05:00'
UNION ALL SELECT 6, 1, 7, '00:07:00'
UNION ALL SELECT 6, 1, 7, '00:10:00'
-- [0, 5] + [7, 11] = 9 minutes
UNION ALL SELECT 7, 1, 8, '00:00:00'
UNION ALL SELECT 7, 1, 8, '00:03:00'
UNION ALL SELECT 7, 2, 8, '00:05:00'
UNION ALL SELECT 7, 2, 8, '00:06:00'
UNION ALL SELECT 7, 1, 8, '00:07:00'
UNION ALL SELECT 7, 1, 8, '00:11:00'
-- [0, 1] + [2, 4] + [5, 7] + [8, 13] = 10
UNION ALL SELECT 8, 1, 9, '00:00:00'
UNION ALL SELECT 8, 2, 9, '00:01:00'
UNION ALL SELECT 8, 1, 9, '00:02:00'
UNION ALL SELECT 8, 1, 9, '00:03:00'
UNION ALL SELECT 8, 2, 9, '00:04:00'
UNION ALL SELECT 8, 1, 9, '00:05:00'
UNION ALL SELECT 8, 1, 9, '00:06:00'
UNION ALL SELECT 8, 2, 9, '00:07:00'
UNION ALL SELECT 8, 1, 9, '00:08:00'
UNION ALL SELECT 8, 1, 9, '00:13:00'
-- there is nothing we can say about either of there requests
-- 0 minutes
UNION ALL SELECT 9, 1, 10, '00:10:00'
UNION ALL SELECT 9, 1, 11, '00:20:00'
;

现在我们得到这样的数据:

WITH numberedcte (rn, userid, courseid, sessionid, requestdate)
AS (
    SELECT ROW_NUMBER() OVER (PARTITION BY sessionid, userid ORDER BY id)
    , userid
    , courseid
    , sessionid
    , requestdate
    FROM PageLogSample
)
, valuecte (value, userid, courseid, sessionid)
AS (
    SELECT CASE
        --alone in session
        WHEN previousrequest.courseid IS NULL
            AND nextrequest.courseid  IS NULL
            THEN 0
        --alone
        WHEN ( previousrequest.courseid IS NULL
            OR previousrequest.courseid <> numberedcte.courseid
            )
            AND nextrequest.courseid <> numberedcte.courseid
            THEN DATEDIFF(MINUTE, numberedcte.requestdate, nextrequest.requestdate)
        --between
        WHEN previousrequest.courseid = nextrequest.courseid
            THEN 0
        --begin
        WHEN previousrequest.courseid IS NULL
            OR nextrequest.courseid = numberedcte.courseid
            THEN -1 * DATEPART(MINUTE, numberedcte.requestdate)
        --ignored (end with no next request)
        WHEN nextrequest.courseid IS NULL
            AND previousrequest.courseid <> numberedcte.courseid
            THEN 0
        --end
        WHEN nextrequest.courseid IS NULL
            OR previousrequest.courseid = numberedcte.courseid
            THEN DATEPART(MINUTE, ISNULL(nextrequest.requestdate, numberedcte.requestdate))
        --impossible?
        ELSE 0
    END
    , numberedcte.userid
    , numberedcte.courseid
    , numberedcte.sessionid
    FROM numberedcte
    LEFT JOIN numberedcte previousrequest
        ON previousrequest.userid = numberedcte.userid
        AND previousrequest.sessionid = numberedcte.sessionid
        AND previousrequest.rn = numberedcte.rn - 1
    LEFT JOIN numberedcte nextrequest
        ON nextrequest.userid = numberedcte.userid
        AND nextrequest.sessionid = numberedcte.sessionid
        AND nextrequest.rn = numberedcte.rn + 1
    WHERE numberedcte.courseid = @courseid
)
SELECT userid
, courseid
, COUNT(DISTINCT sessionid) AS sessioncount
, SUM(value) AS duration
FROM valuecte
GROUP BY userid
, courseid
ORDER BY userid
;

这是我得到的结果。我很满意。注意会话计数如何对用户9保持正确。

userid  courseid  sessioncount  duration
1       1         1             10
2       1         1              3
3       1         1              6
4       1         1              4
5       1         2             10
6       1         1             10
7       1         1              9
8       1         1             10
9       1         2              0

答案 4 :(得分:-1)

“数据是正确的,但很难从中得到相关的意义。”

我很想回应这是一个矛盾的术语。您不知道其含义的数据不是数据。

关于你原来的问题:

您需要的是一个DBMS,它为INTERVAL类型提供了不错的支持。没有SQL系统在那个联盟中发挥作用。除了一些教程系统之外,我自己的DBMS(在此上下文中没有进一步推动,所以没有链接)是我所知道的唯一提供此类问题所需的支持。

如果您有兴趣,请浏览“间隔类型”,“打包正常形式”,“时态数据”,最后您会遇到它。