Question

我有以下sql语句，完全正常。我希望看到这可以如何重构，所以它不需要使用RANK / PARTITION ......如果可能的话。

SELECT LogEntryId, FileId, CreatedOn, EventTypeId
FROM (SELECT a.LogEntryId, a.FileId, a.CreatedOn,  a.EventTypeId, 
        RANK() OVER (PARTITION BY ClientName ORDER BY a.CreatedOn DESC) AS MostRecentEventRank
    FROM LogEntries a
    WHERE (a.EventTypeId = 2 or a.EventTypeId = 4)) SubQuery
WHERE MostRecentEventRank = 1

它想做什么？

抓取表格中的所有记录，按客户名称分组，然后按最近创建的顺序排序。
仅通过事件类型＃2（连接）或＃4（断开连接）进行过滤。
现在，对于每个客户名称，检索最新记录。

这实际上是为表中的每个唯一用户抓取最近的事件（对于连接或断开连接）。

我喜欢RANK / PARTITION，但我希望看看是否可以不使用它。

Answer 1

另一种变化：选择客户端，然后使用CROSS APPLY（.. TOP（1）... ORDER BY ...）来获取相关条目。

SELECT c.ClientName,r.LogEntryId, r.FileId, r.CreatedOn,  r.EventTypeId
FROM (
 SELECT DISTINCT ClientName
 FROM LogEntries
 WHERE EventTypeId IN (2,4)) as c
CROSS APPLY (
   SELECT TOP (1) a.LogEntryId, a.FileId, a.CreatedOn,  a.EventTypeId
   FROM LogEntries as a
   WHERE a.ClientName = c.ClientName
   AND a.EventTypeId IN (2,4)
   ORDER BY a.CreatedOn DESC) as r;

<强>更新

在不知道架构的情况下谈论T-SQL查询的性能是没有意义的。对于需要适当设计的架构，此查询是完美的最佳选择。由于访问是通过ClientName和CreatedOn进行的，因此即使是简单的模式也需要考虑这一点：

CREATE TABLE LogEntries (
   LogEntryId int identity(1,1),
   FileID int,
   CreatedOn datetime,
   EventTypeID int,
   ClientName varchar(30)
);

create clustered index cdxLogEntries on LogEntries (
    ClientName, CreatedOn DESC);
go

然后让我们加载一些2.4M行的表：

declare @i int;
set @i = 0;

while @i < 1000
begin
    insert into LogEntries (FileId, CreatedOn, EventTypeId, ClientName)
    select cast(rand()*100 as int),
        dateadd(minute, -rand()*10000, getdate()),
        cast(rand() * 5 as int),
        'Client' + cast(@i as varchar(10))
        from master..spt_values;
    set @i = @i+1;
end

我们在温暖的缓存上使用set statistics io on; set statistics time on;获得了什么时间和IO？

(410 row(s) affected)
Table 'LogEntries'. Scan count 411, logical reads 14354, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
   CPU time = 1219 ms,  elapsed time = 1932 ms.

1.9秒从我的笔记本电脑上获得2.4M条目的数据（4年前有1Gb RAM）。在架构设计方面仍有很大的改进空间。将ClientName分离为一个规范化的表，其中包含一个从LogEntries到它的可信外键，这将显着减少时间。 EntryTypeId IN（2,4）上的适当过滤索引也会有所贡献。我们甚至没有开始探索并行性的可能性。

这是SQL，性能是在架构的绘图板上获得的，而不是在查询的文本编辑器中获得。

Answer 2

单表扫描，没有窗口函数，单组依次，没有重复日期的问题，与窗口函数具有相同的性能，甚至优于大型查询。（更新：与TOP 1 WITH TIES / CROSS APPLY方法相比，我不知道它的表现如何。由于它使用扫描，在某些情况下可能会更慢。）

SELECT
   LogEntryID = Convert(int, Substring(Packed, 9, 4)),
   FileID = Convert(int, Substring(Packed, 13, 4)),
   CreatedOn = Convert(datetime, Substring(Packed, 1, 8)),
   EventTypeID = Convert(int, Substring(Packed, 17, 4))
FROM
   (
      SELECT
         Packed = Max(
            Convert(binary(8), CreatedOn)
            + Convert(binary(4), LogEntryID)
            + Convert(binary(4), FileID)
            + Convert(binary(4), EventTypeID)
         )
      FROM LogEntries
      WHERE EventTypeID IN (2,4)
      GROUP BY ClientName
   ) X

如果有人想看到这个，这里有一些创作脚本：

USE tempdb
CREATE TABLE LogEntries (
   LogEntryID int not null identity(1,1),
   FileID int,
   CreatedOn datetime,
   EventTypeID int,
   ClientName varchar(30)
)

INSERT LogEntries VALUES (1, GetDate()-20, 2, 'bob')
INSERT LogEntries VALUES (1, GetDate()-19, 3, 'bob')
INSERT LogEntries VALUES (1, GetDate()-18, 4, 'bob')
INSERT LogEntries VALUES (1, GetDate()-17, 3, 'bob')
INSERT LogEntries VALUES (1, GetDate()-19.5, 2, 'anna')
INSERT LogEntries VALUES (1, GetDate()-18.5, 3, 'anna')
INSERT LogEntries VALUES (1, GetDate()-17.5, 4, 'anna')
INSERT LogEntries VALUES (1, GetDate()-16.5, 3, 'anna')

请注意，此方法利用给定数据类型的内部字节表示，其具有与类型值相同的顺序。像float或decimal这样的打包数据类型不起作用：首先需要转换为合适的数据类型，例如int，bigint或character。

此外，SQL 2008中的新日期和时间数据类型具有不同的表示形式，无法正确打包以与此方法一起使用。我还没有检查过Time数据类型，但是对于Date数据类型：

DECLARE @d date
SET @d ='99990101'
SELECT Convert(binary(3), @d) -- 0x6EB837

实际值为0x37B86E，因此它以反向字节顺序存储它们（“零”日期为0001-01-01）。

Answer 3

您可以使用独占left join：

select     cur.*
from       LogEntries cur
left join  LogEntries next
on         next.ClientName = cur.ClientName
           and next.EventTypeId in (2,4)
           and next.CreatedOn > cur.CreatedOn               
where      next.ClientName is null
           and cur.EventTypeId in (2,4)

这会将表连接到自身，搜索on条件中的后续行。在where子句中，指定不存在以后的行。通过这种方式，您可以过滤除每个客户端的最新行。

Answer 4

你走了。可能会更快......不确定。此外，这假设ClientName + CreatedOn是唯一的。

;WITH MostRecent AS
(
   SELECT ClientName, Max(CreatedOn) AS CreatedOn
   FROM LogEntries
   WHERE EventTypeID IN (2,4)
   GROUP BY ClientName
)
SELECT LogEntryId, FileId, CreatedOn, EventTypeId
FROM LogEntries L
INNER JOIN MostRecent R ON L.ClientName = R.ClientName AND L.CreatedOn = R.CreatedON

注意，我没有测试可能有拼写错误。

这个Sql语句可以重构为不使用RANK / PARTITION吗？

4 个答案: