我有日志数据,我正在尝试尽可能地回填数据,以帮助改进分析。
日志数据包含一个SessionId,它是浏览器创建的SessionId,登录用户的名称(如果他们已登录)和LogTime。
我正在尝试获取所有相关会话,彼此在24小时内的会话,并获得该组会话的第一个日期,该组会话的最后日期并填充第一个非空和不是所有其他名称空间的空名。
例如,如果我有以下数据:
--Id SessionId Name LogTime
--1 1 2018-01-01 00:00
--2 1 LargeOne 2018-01-01 12:00
--3 2 Two 2018-01-01 13:00
--4 3 NULL 2018-01-02 00:00
--5 3 2018-01-03 00:00
--6 1 One 2018-01-03 00:00
--7 2 2018-01-03 00:00
--8 2 LargeTwo 2018-01-04 00:00
--9 1 2018-01-04 00:00
我想按如下方式处理数据:
--Id SessionId Name LogTime StartTime EndTime
--1 1 LargeOne 2018-01-01 00:00 2018-01-01 00:00 2018-01-01 12:00
--2 1 LargeOne 2018-01-01 12:00 2018-01-01 00:00 2018-01-01 12:00
--3 2 Two 2018-01-01 13:00 2018-01-01 13:00 2018-01-01 13:00
--4 3 NULL 2018-01-02 00:00 2018-01-02 00:00 2018-01-03 00:00
--5 3 NULL 2018-01-03 00:00 2018-01-02 00:00 2018-01-03 00:00
--6 1 One 2018-01-03 00:00 2018-01-03 00:00 2018-01-04 00:00
--7 2 LargeTwo 2018-01-03 00:00 2018-01-03 00:00 2018-01-04 00:00
--8 2 LargeTwo 2018-01-04 00:00 2018-01-03 00:00 2018-01-04 00:00
--9 1 One 2018-01-04 00:00 2018-01-03 00:00 2018-01-04 00:00
Ids 1和2在相同的会话中并且在彼此的范围内(24小时),因此他们创建了一个集合,注意Id 1没有名称列但是ID 2,并且因为它是相同的一套,它回填了名字。 Ids 6和9也在会话1中,但不在第一组的24小时范围内,因此它创建了一个新的集合,Ids 6和9都在会话1中,即使新会话出现在它们之间,它们仍然是范围内的相同会话因此他们创建了一个新的集合。
我认为这涵盖了解释问题,现在我尝试寻找解决方案。要查找和回填名称,我尝试使用:
SELECT Id,SessionId,
FIRST_VALUE(Name) OVER (PARTITION BY SessionId ORDER BY CASE WHEN Name IS NULL or Name='' then 0 ELSE 1 END DESC,Id) Name,
LogTime
FROM #RawData
ORDER BY Id
这会产生:
--Id SessionId Name LogTime
--1 1 LargeOne 2018-01-01 00:00
--2 1 LargeOne 2018-01-01 12:00
--3 2 Two 2018-01-01 13:00
--4 3 NULL 2018-01-02 00:00
--5 3 NULL 2018-01-03 00:00
--6 1 LargeOne 2018-01-03 00:00
--7 2 Two 2018-01-03 00:00
--8 2 Two 2018-01-04 00:00
--9 1 LargeOne 2018-01-04 00:00
这几乎可行,但不考虑日期范围。
所以我做了很多关于如何根据SessionId和日期范围获取组的方法,我想出了这个:
;WITH ProcessTable1 AS
(
SELECT Id,SessionId,Name,LogTime,
PreviousLogTimeInRange = CASE WHEN LAG(LogTime, 1) OVER (partition by SessionId ORDER BY LogTime) between DATEADD(day, -1, LogTime) and LogTime
THEN 0 ELSE 1 END,
NextLogTimeInRange = CASE WHEN Lead(LogTime,1) OVER (partition by SessionId ORDER BY LogTime) between LogTime and DATEADD(day, 1, LogTime)
THEN 0 ELSE 1 END
FROM #RawData
),
ProcessTable2 AS
(
SELECT Id, Name, SessionId, LogTime, PreviousLogTimeInRange,
NextLogTime = case when NextLogTimeInRange = 0 then LEAD(LogTime, 1) OVER (partition by SessionId ORDER BY LogTime) else LogTime end
FROM ProcessTable1 WHERE 1 IN (PreviousLogTimeInRange, NextLogTimeInRange)
)
SELECT Id,SessionId,
FIRST_VALUE(Name) OVER (PARTITION BY SessionId ORDER BY CASE WHEN Name IS NULL or Name = '' then 0 ELSE 1 END DESC, Id) Name,
LogTime, NextLogTime
FROM ProcessTable2
--WHERE PreviousLogTimeInRange = 1
ORDER BY id;
这会产生:
--Id SessionId Name LogTime NextLogTime
--1 1 LargeOne 2018-01-01 00:00 2018-01-01 12:00
--2 1 LargeOne 2018-01-01 12:00 2018-01-01 12:00
--3 2 Two 2018-01-01 13:00 2018-01-01 13:00
--4 3 NULL 2018-01-02 00:00 2018-01-03 00:00
--5 3 NULL 2018-01-03 00:00 2018-01-03 00:00
--6 1 LargeOne 2018-01-03 00:00 2018-01-04 00:00
--7 2 Two 2018-01-03 00:00 2018-01-04 00:00
--8 2 Two 2018-01-04 00:00 2018-01-04 00:00
--9 1 LargeOne 2018-01-04 00:00 2018-01-04 00:00
如此接近,但我仍然需要StartTime,说实话,我并非100%确定这将永远做我想要的。
最后一个查询部分是根据调查结果创建的 SQL Query to group items by time, but only if near each other?
如果有人愿意在这里伸出援助之手,我将永远感激不尽!
- 编辑 -
如果有人想给它打击,我已经创建了一些数据。
IF OBJECT_ID('tempdb..#RawData') IS NOT NULL DROP TABLE #RawData
GO
Create Table #RawData
(
Id INT IDENTITY,
SessionId INT NOT NULL,
Name NVARCHAR(50) NULL,
LogTime DATETIME NOT NULL
)
INSERT INTO #RawData(SessionId,Name,LogTime)
VALUES
(1, '', '2018-01-01 00:00'),
(1, 'LargeOne', '2018-01-01 12:00'),
(2, 'Two', '2018-01-01 13:00'),
(3, NULL, '2018-01-02 00:00'),
(3, '', '2018-01-03 00:00'),
(1, 'One', '2018-01-03 00:00'),
(2, '', '2018-01-03 00:00'),
(2, 'LargeTwo', '2018-01-04 00:00'),
(1, '', '2018-01-04 00:00')
SELECT * FROM #RawData
答案 0 :(得分:0)
您基本上想要LAG(. . . IGNORE NULLS)
,但SQL Server不支持。
相反,您可以在id
上使用累计最大/最小值。这是一个想法:
select t.*,
max(name) over (partition by sessionid, grpmax) as name
from (select t.*,
max(case when name <> '' or name is null then id end) over (partition by sessionid) as grp
from t
) t;
这会填充值“向前”,但不会向后填充。向后做它也有一些问题。但你可以用类似的逻辑来做到这一点:
select t.*,
(case when max(name) over (partition by sessionid, grpafter) <> ''
then max(name) over (partition by sessionid, grpafter)
else min(name) over (partition by sessionid, grpbefore)
end) as name
from (select t.*,
min(case when name <> '' or name is null then id end) over (partition by sessionid order by id desc) as grpafter
max(case when name <> '' or name is null then id end) over (partition by sessionid order by id asc) as grpbefore
from t
) t;
答案 1 :(得分:0)
IF OBJECT_ID('tempdb..#RawData') IS NOT NULL DROP TABLE #RawData
GO
Create Table #RawData
(
Id INT IDENTITY,
SessionId INT NOT NULL,
Name NVARCHAR(50) NULL,
LogTime DATETIME NOT NULL
)
INSERT INTO #RawData(SessionId,Name,LogTime)
VALUES
(1, '', '2018-01-01 00:00'),
(1, 'LargeOne', '2018-01-01 12:00'),
(2, 'Two', '2018-01-01 13:00'),
(3, NULL, '2018-01-02 00:00'),
(3, '', '2018-01-03 00:00'),
(1, 'One', '2018-01-03 00:00'),
(2, '', '2018-01-03 00:00'),
(2, 'LargeTwo', '2018-01-04 00:00'),
(1, '', '2018-01-04 00:00')
go
with my_sql as (
SELECT t1.SessionId,
t1.Name,
t1.LogTime ,
(
SELECT min( t2.LogTime )
from #RawData t2
where t1.SessionId = t2.SessionId
and cast( t1.LogTime as date ) >= cast( t2.LogTime as date )
and cast( t1.LogTime as date ) <= dateadd(day, 1, t2.LogTime)
) as StartTime
FROM #RawData t1
)
--select * from my_sql
SELECT ms.SessionId,
( select top 1 t.name
from my_sql t
where ms.SessionId = t.SessionId
and cast(ms.StartTime as date ) = cast(t.StartTime as date)
and t.name <> ''
) as name,
ms.LogTime,
ms.StartTime,
( select max(t.LogTime)
from my_sql t
where ms.SessionId = t.SessionId
and cast(ms.StartTime as date ) = cast(t.StartTime as date)
) as endTime
FROM my_sql ms