基于用户活动的用户会话

时间:2016-03-24 18:22:45

标签: sql sql-server

前言:对不起标题,它并没有真正描述这个问题。请随时编辑。

显示用户活动的数据:

actionid userid subjectid dtm
1        1      1         2016-01-04 08:00:00.000
2        1      1         2016-01-04 08:10:00.000
3        1      1         2016-01-04 08:20:00.000
4        1      1         2016-01-04 10:10:00.000
5        1      1         2016-01-04 10:25:00.000
11       1      2         2016-01-05 08:00:00.000
12       1      2         2016-01-25 08:10:00.000
13       1      2         2016-01-05 08:20:00.000
14       1      2         2016-01-05 10:20:00.000
15       1      2         2016-01-05 10:25:00.000
16       2      1         2016-01-05 12:00:00.000
17       2      1         2016-01-05 12:00:00.000
18       2      3         2016-01-05 12:04:00.000
19       2      3         2016-01-05 12:05:00.000
20       2      1         2016-01-05 12:07:30.000

“会话”定义为在同一个会议中完成的用户活动的集合。会话从处理特定主题的用户开始,并在(a)用户切换到不同主题或(b)60分钟通过时结束。

目标是识别会话并为它们分配会话ID,所有这些都在SQL中。标准SQL是首选,但由于此数据位于Microsoft SQL Server上,因此T-SQL功能(如CROSS APPLY,LEAD,LAG等)也可以接受。

期望的结果将是:

actionid userid subjectid dtm                     sessionid
1        1      1         2016-01-04 08:00:00.000 1
2        1      1         2016-01-04 08:10:00.000 1
3        1      1         2016-01-04 08:20:00.000 1
4        1      1         2016-01-04 10:10:00.000 2
5        1      1         2016-01-04 10:25:00.000 2
11       1      2         2016-01-05 08:00:00.000 3
12       1      2         2016-01-25 08:10:00.000 3
13       1      2         2016-01-05 08:20:00.000 3
14       1      2         2016-01-05 10:20:00.000 4
15       1      2         2016-01-05 10:25:00.000 4
16       2      1         2016-01-05 12:00:00.000 5
17       2      1         2016-01-05 12:00:00.000 5
18       2      3         2016-01-05 12:04:00.000 6
19       2      3         2016-01-05 12:05:00.000 6
20       2      1         2016-01-05 12:07:30.000 7

为了清楚起见,我将使用此示例再次解释逻辑:

行动#4不会属于会话#1,因为距离该会话中的第一个行动超过1小时。当用户切换到主题2时,会话#2结束。操作#16和#17具有相同的时间,这很好。行动#20将成为一个新会议,即使它在行动#17的一小时内,因为用户在#18和#20之间切换了主题。

1 个答案:

答案 0 :(得分:1)

如果使用SQL Server 2012 +,那么您肯定希望使用LAG()。下面是使用旧方法与使用LAG()窗口函数进行比较。

旧方式

SELECT
    this.actionid,
    this.userid,
    this.subjectid,
    this.dtm,
    UserSubjectBoundryMarker=CASE WHEN prev.subjectid<>this.subjectid OR prev.subjectID IS NULL THEN 1 ELSE 0 END   
FROM
(
    SELECT
        *,
        ID=ROW_NUMBER() OVER(PARTITION BY 1 order by userid,actionid)
    FROM
        mytable
    )AS this
    LEFT OUTER JOIN 
    (
        SELECT
            *,
            ID=ROW_NUMBER() OVER(PARTITION BY 1 order by userid,actionid)
        FROM
            mytable 
    )
    Prev ON (Prev.ID=this.ID-1) OR(Prev.ID IS NULL)
ORDER BY
    userid,dtm

使用LAG()

SELECT
    *,
    UserSubjectBoundryMarker=CASE WHEN LAG(subjectid, 1,0) OVER (ORDER BY userid,actionid) <> this.subjectID THEN 1 ELSE 0 END
FROM
    mytable this

<强>解决方案

1. Mark dynamic subject group boundaries with a marker.
2. Apply a dynamic group id to changes using max(change=1) + max(actionid)
3. Mark dynamic hour group boundaries to the result of 2 with a marker.
4. Apply a dynamic group id to the changes using max(change=1) + max(result of #2).
5. Apply a dense rank to your outer group from #4 in a sequential order.
SELECT
    e.actionid,
    e.userid,
    e.subjectid,
    e.dtm,
    sessionid=DENSE_RANK() OVER(PARTITION BY  1 ORDER BY userid,subjectid,SubjectChangeGroupID,SubjectHourChangeBoundry,SubjectHourChangeGroupID  )
FROM
(
    SELECT
        *,
        SubjectHourChangeGroupID=
            MAX(SubjectHourChangeBoundry) OVER (PARTITION BY userid,subjectid,SubjectChangeGroupID,SubjectHourChangeBoundry) 
            +
            MAX(SubjectChangeGroupID) OVER (PARTITION BY userid,subjectid,SubjectChangeGroupID,SubjectHourChangeBoundry)    
    FROM
    (

        SELECT
            *,
            SubjectHourChangeBoundry=CASE WHEN DATEDIFF(HOUR,MIN(C.dtm) OVER (PARTITION BY SubjectChangeGroupID),C.dtm) > 1 THEN 1 ELSE 0 END
        FROM
        (
            SELECT 
                *,
                SubjectChangeGroupID=
                    MAX(UserSubjectBoundryMarker) OVER (PARTITION BY userid,subjectid) 
                    +
                    MAX(actionid) OVER (PARTITION BY userid,subjectid)
            FROM
            (
                SELECT *,UserSubjectBoundryMarker=CASE WHEN LAG(a.subjectid, 1,0) OVER (ORDER BY a.userid,a.actionid) <> a.subjectID THEN 1 ELSE 0 END
                FROM
                    mytable A
            )AS B
        )AS C
    )AS D   
)AS E