通常,我需要关联(分组)在相似时间段内创建的记录。如果它有帮助,可以将下面的示例视为没有sessionID的点击流数据,我需要构建这些会话。
我有以下数据集:
UserId INT,
EventId INT,
DateCreated DATETIME,
BlockId INT
假设以下数据:
{123, 111, '2009-12-01 9:15am', NULL}
{123, 222, '2009-12-01 9:20am', NULL}
{123, 333, '2009-12-01 9:25am', NULL}
{123, 444, '2009-12-03 2:30pm', NULL}
{123, 555, '2009-12-03 2:32pm', NULL}
我需要做的是将这些事件由用户划分为临时桶。有一个商业规则说什么> 30分钟应该是一个新的桶。在上面的例子中,事件111-333表示一个块,即不超过30分钟将它们分开。同样,事件444-555表示第二块。
我当前的解决方案使用游标并且速度非常慢(因此,我需要处理的数据量不可持续)。我可以发布代码但很简单。
有什么想法吗?
答案 0 :(得分:1)
基于评论主题,
一个。存储桶由存储桶中的第一个记录定义,每个存储桶中的第一个记录定义为DateCreated在最新的早期DateCreated之后超过30分钟的任何行。 (紧接在前的记录)
B中。存储桶中的其余行都是在第一行上或之后具有DateCreated的行,其中DateCreated小于前一行之后的30分钟,并且不存在非限定的(或新的存储桶定义),自指定的Bucket-defined行以来的行。
英文:
选择那些记录的DateCreated,其中DateCreated在上一个DateCreated之后超过30分钟,并且您选择的聚合函数在表中所有其他记录中,其中DateCreated在该桶定义的日期之后创建,不到30分钟之后immedialte之前的DateCreated,并且在存储桶定义的DateCreated和这个存在大于30分钟的间隔之间没有记录。
在SQL中:
Select Z.BucketDefinitionDate , Count(*) RowsInBucket
From (Select Distinct DateCreated BucketDefinitionDate
From Table Ti
Where DateCreated > DateAdd(minute, 30,
(Select Max(DateCreated) From Table
Where DateCreated < Ti.DateCreated))) Z
Join Table B
On B.DateCreated > Z.BucketDefinitionDate
And Not Exists
(Select * From Table
Where DateCreated Between Z.BucketDefinitionDate
And B.DateCreated
And DateCreated > DateAdd(minute, 30,
(Select Max(DateCreated) From Table
Where DateCreated < B.DateCreated)))
Group By Z.BucketDefinitionDate
答案 1 :(得分:1)
希望这会让你朝着正确的方向前进。如果您在SP中,那么对StartTimes和EndTimes使用表变量应该使查询更容易阅读和理解。这将为您的批次提供开始和结束时间,然后只需加入您的桌子就可以了。
;WITH StartTimes AS
(
SELECT DISTINCT
T1.DateCreated AS StartTime
FROM
My_Table T1
LEFT OUTER JOIN My_Table T2 ON
T2.UserID = T1.UserID AND
T2.EventID = T1.EventID AND
T2.DateCreated >= DATEADD(mi, -30, T1.DateCreated) AND
T2.DateCreated < T1.DateCreated
WHERE
T2.UserID IS NULL
)
SELECT
StartTimes.StartTime,
EndTimes.EndTime
FROM
(
SELECT DISTINCT
T3.DateCreated AS EndTime
FROM
My_Table T3
LEFT OUTER JOIN My_Table T4 ON
T4.UserID = T3.UserID AND
T4.EventID = T3.EventID AND
T4.DateCreated <= DATEADD(mi, 30, T3.DateCreated) AND
T4.DateCreated > T3.DateCreated
WHERE
T4.UserID IS NULL
) AS ET
INNER JOIN StartTimes ST ON
ST.StartTime <= ET.EndTimes
LEFT OUTER JOIN StartTimes ST2 ON
ST2.StartTime <= ET.EndTimes AND
ST2.StartTime > ST.StartTime
WHERE
ST2.StartTime IS NULL
答案 2 :(得分:0)
您可以尝试的是
DECLARE @TABLE TABLE(
ID INT,
EventID INT,
DateCreated DATETIME
)
INSERT INTO @TABLE SELECT 123, 111, '2009-12-01 9:15am'
INSERT INTO @TABLE SELECT 123, 222, '2009-12-01 9:20am'
INSERT INTO @TABLE SELECT 123, 333, '2009-12-01 9:25am'
INSERT INTO @TABLE SELECT 123, 444, '2009-12-03 2:30pm'
INSERT INTO @TABLE SELECT 123, 555, '2009-12-01 2:32pm'
SELECT ID,
DATEADD(dd, DATEDIFF(dd,0,DateCreated), 0) DayVal,
DATEPART(hh, DateCreated) HourPart,
FLOOR(DATEPART(mi, DateCreated) / 30.) MinBucket
FROM @TABLE
现在您可以按DayVal,HourPart和MinBucket进行分组。
答案 3 :(得分:0)
我想我有东西给你。它不是像汤姆H发布的那样酷的单一查询,但似乎有效。它使用表变量作为工作表。
declare @table table(
id int identity(1,1),
userId int,
eventId int,dateCreated datetime,
bucket int
)
insert into @table select 123, 111, '2009-12-01 9:15am', 0
// etc... insert more rows - note that the 'bucket' field is set to 0
declare @next_bucket int
set @next_bucket = 1
update @table
set bucket = @next_bucket, @next_bucket = @next_bucket + 1
from @table as [current]
where datecreated > dateadd(mi, 30, (select datecreated from @table as previous where [current].id = previous.id + 1))
update @table
set bucket =
coalesce(( select max(bucket)
from @table as previous
where previous.id < [current].id
and bucket <> 0
), 1)
from @table as [current]
where bucket = 0
-- return the results
select * from @table