基于时间戳和重复模式对表进行重复数据删除

时间:2011-04-02 19:20:47

标签: sql tsql sql-server-2008 duplicates sequence

确定。如果这个问题已被涵盖,首先让我道歉。我看了看,但没有一个解决方案解决了我的问题的细节。

随着时间的推移,我有一个超过1.6亿行数据跟踪员工/服务器状况的表。我想创建这个数据的子集并删除整个过程中发生的重复,但是当它们发生时保持变化的顺序。大多数员工的减少量将从700行(并且不断增长)增加到1.

以下是我想要了解的简化示例:

Given:

RowID  Employee  Server  Timestamp
-----  --------  ------  ---------
5      E000001   Serv-B  May01
4      E000001   Serv-A  Apr01
3      E000001   Serv-B  Mar01
2      E000001   Serv-A  Feb01
1      E000001   Serv-A  Jan01

Doing a "Min(Timestamp) Group By Employee, Server" would yield:
Employee Server  Timestamp
-------- ------  ---------
E000001  Serv-B  Mar01
E000001  Serv-A  Jan01
.
What I need is:
Employee Server  Timestamp
-------- ------  ---------
E000001  Serv-B  May01
E000001  Serv-A  Apr01
E000001  Serv-B  Mar01
E000001  Serv-A  Jan01

表格和提供它的过程不属于我们的小组,所以我不能影响那里的解决方案,我宁愿不被困在整个事物的副本。考虑到表的大小,我无法实际执行游标/ RBAR方法。如果支持角落,我可以编写一个应用程序来执行此操作,但我想知道SQoLympus中的任何神在存储过程中是否有任何智慧。提前谢谢!

编辑:这是SQL Server 2008 - 很抱歉没有提及它。

2 个答案:

答案 0 :(得分:1)

如果是SQL Server(假设我已正确理解您的要求)

/*Set up test table*/
DECLARE @T TABLE (
  RowID       INT,
  Employee    CHAR(7),
  [Server]    CHAR(6),
  [timestamp] DATETIME );

INSERT INTO @T
SELECT 5,'E000001','Serv-B',  '20010501' UNION ALL
SELECT 4,'E000001','Serv-A',  '20010401' UNION ALL
SELECT 3,'E000001','Serv-B',  '20010301' UNION ALL
SELECT 2,'E000001','Serv-A',  '20010201' UNION ALL
SELECT 1,'E000001','Serv-A',  '20010101';

WITH cte
     As (SELECT ROW_NUMBER() OVER (PARTITION BY Employee ORDER BY RowID) -
                ROW_NUMBER() OVER (PARTITION BY Employee, Server
                                       ORDER BY RowID) AS Grp,
                *
         FROM   @T),
     cte2
     AS (SELECT *,
                ROW_NUMBER() OVER (PARTITION BY Employee, Grp ORDER BY RowID) AS
                Rn
         FROM   cte)

/* Edit: Actually - You want a SELECT not a DELETE I think?
DELETE FROM cte2 WHERE  Rn > 1*/

SELECT   RowID, Employee, [Server], [timestamp]
FROM cte2
WHERE  Rn = 1

答案 1 :(得分:0)

您没有说出您正在使用的数据库,但如果这是Oracle,您可以使用laglead分析函数来引用上一行或下一行。

select employee, server, timestamp 
from
   (select employee, server, timestamp,
    lag(employee) over (order by employee, server, timestamp) prev_employee 
    lag(server) over (order by employee, server, timestamp) prev_server 
    from table
   )
where not (employee = prev_employee and server = prev_server)