前言:对不起标题,它并没有真正描述这个问题。请随时编辑。
显示用户活动的数据:
actionid userid subjectid dtm
1 1 1 2016-01-04 08:00:00.000
2 1 1 2016-01-04 08:10:00.000
3 1 1 2016-01-04 08:20:00.000
4 1 1 2016-01-04 10:10:00.000
5 1 1 2016-01-04 10:25:00.000
11 1 2 2016-01-05 08:00:00.000
12 1 2 2016-01-25 08:10:00.000
13 1 2 2016-01-05 08:20:00.000
14 1 2 2016-01-05 10:20:00.000
15 1 2 2016-01-05 10:25:00.000
16 2 1 2016-01-05 12:00:00.000
17 2 1 2016-01-05 12:00:00.000
18 2 3 2016-01-05 12:04:00.000
19 2 3 2016-01-05 12:05:00.000
20 2 1 2016-01-05 12:07:30.000
“会话”定义为在同一个会议中完成的用户活动的集合。会话从处理特定主题的用户开始,并在(a)用户切换到不同主题或(b)60分钟通过时结束。
目标是识别会话并为它们分配会话ID,所有这些都在SQL中。标准SQL是首选,但由于此数据位于Microsoft SQL Server上,因此T-SQL功能(如CROSS APPLY,LEAD,LAG等)也可以接受。
期望的结果将是:
actionid userid subjectid dtm sessionid
1 1 1 2016-01-04 08:00:00.000 1
2 1 1 2016-01-04 08:10:00.000 1
3 1 1 2016-01-04 08:20:00.000 1
4 1 1 2016-01-04 10:10:00.000 2
5 1 1 2016-01-04 10:25:00.000 2
11 1 2 2016-01-05 08:00:00.000 3
12 1 2 2016-01-25 08:10:00.000 3
13 1 2 2016-01-05 08:20:00.000 3
14 1 2 2016-01-05 10:20:00.000 4
15 1 2 2016-01-05 10:25:00.000 4
16 2 1 2016-01-05 12:00:00.000 5
17 2 1 2016-01-05 12:00:00.000 5
18 2 3 2016-01-05 12:04:00.000 6
19 2 3 2016-01-05 12:05:00.000 6
20 2 1 2016-01-05 12:07:30.000 7
为了清楚起见,我将使用此示例再次解释逻辑:
行动#4不会属于会话#1,因为距离该会话中的第一个行动超过1小时。当用户切换到主题2时,会话#2结束。操作#16和#17具有相同的时间,这很好。行动#20将成为一个新会议,即使它在行动#17的一小时内,因为用户在#18和#20之间切换了主题。
答案 0 :(得分:1)
如果使用SQL Server 2012 +,那么您肯定希望使用LAG()
。下面是使用旧方法与使用LAG()
窗口函数进行比较。
旧方式
SELECT
this.actionid,
this.userid,
this.subjectid,
this.dtm,
UserSubjectBoundryMarker=CASE WHEN prev.subjectid<>this.subjectid OR prev.subjectID IS NULL THEN 1 ELSE 0 END
FROM
(
SELECT
*,
ID=ROW_NUMBER() OVER(PARTITION BY 1 order by userid,actionid)
FROM
mytable
)AS this
LEFT OUTER JOIN
(
SELECT
*,
ID=ROW_NUMBER() OVER(PARTITION BY 1 order by userid,actionid)
FROM
mytable
)
Prev ON (Prev.ID=this.ID-1) OR(Prev.ID IS NULL)
ORDER BY
userid,dtm
使用LAG()
SELECT
*,
UserSubjectBoundryMarker=CASE WHEN LAG(subjectid, 1,0) OVER (ORDER BY userid,actionid) <> this.subjectID THEN 1 ELSE 0 END
FROM
mytable this
<强>解决方案
1. Mark dynamic subject group boundaries with a marker.
2. Apply a dynamic group id to changes using max(change=1) + max(actionid)
3. Mark dynamic hour group boundaries to the result of 2 with a marker.
4. Apply a dynamic group id to the changes using max(change=1) + max(result of #2).
5. Apply a dense rank to your outer group from #4 in a sequential order.
SELECT
e.actionid,
e.userid,
e.subjectid,
e.dtm,
sessionid=DENSE_RANK() OVER(PARTITION BY 1 ORDER BY userid,subjectid,SubjectChangeGroupID,SubjectHourChangeBoundry,SubjectHourChangeGroupID )
FROM
(
SELECT
*,
SubjectHourChangeGroupID=
MAX(SubjectHourChangeBoundry) OVER (PARTITION BY userid,subjectid,SubjectChangeGroupID,SubjectHourChangeBoundry)
+
MAX(SubjectChangeGroupID) OVER (PARTITION BY userid,subjectid,SubjectChangeGroupID,SubjectHourChangeBoundry)
FROM
(
SELECT
*,
SubjectHourChangeBoundry=CASE WHEN DATEDIFF(HOUR,MIN(C.dtm) OVER (PARTITION BY SubjectChangeGroupID),C.dtm) > 1 THEN 1 ELSE 0 END
FROM
(
SELECT
*,
SubjectChangeGroupID=
MAX(UserSubjectBoundryMarker) OVER (PARTITION BY userid,subjectid)
+
MAX(actionid) OVER (PARTITION BY userid,subjectid)
FROM
(
SELECT *,UserSubjectBoundryMarker=CASE WHEN LAG(a.subjectid, 1,0) OVER (ORDER BY a.userid,a.actionid) <> a.subjectID THEN 1 ELSE 0 END
FROM
mytable A
)AS B
)AS C
)AS D
)AS E