Question

我有一个表costhistory，包含字段id，invid，vendorid，cost，timestamp，chdeleted。每当供应商更新其价格清单时，它看起来就像是一个触发器。

它有多余的记录 - 因为无论自上次记录以来价格是否发生变化，它都会被填充示例：

id | invid | vendorid | cost | timestamp | chdeleted  
1 | 123 | 1 | 100 | 1/1/01 | 0  
2 | 123 | 1 | 100 | 1/2/01 | 0  
3 | 123 | 1 | 100 | 1/3/01 | 0  
4 | 123 | 1 | 500 | 1/4/01 | 0  
5 | 123 | 1 | 500 | 1/5/01 | 0  
6 | 123 | 1 | 100 | 1/6/01 | 0

我想删除ID为2,3,5的记录，因为它们没有反映自上次价格更新以来的任何变化。

我确信它可以完成，但可能需要几个步骤。为了清楚起见，这个表已经膨胀到100gb并包含600M行。我相信正确的清理会使这张桌子的尺寸减少90％ - 95％。

谢谢！

Answer 1

您采取的方法将根据您使用的数据库而有所不同。对于SQL Server 2005+，以下查询应该为您提供要删除的记录：

select id 
from (
    select id, Rank() over (Partition BY invid, vendorid, cost order by timestamp) as Rank
    from costhistory 
) tmp
where Rank > 1

然后你可以删除它们：

delete from costhistory 
where id in (
    select id 
    from (
        select id, Rank() over (Partition BY invid, vendorid, cost order by timestamp) as Rank
        from costhistory 
    ) tmp
)

Answer 2

我建议你使用group by query重新创建表。另外，我假设“id”列未在任何其他表中使用。如果是这种情况，那么您也需要修复这些表。

删除如此大量的记录可能需要很长时间。

查询看起来像：

insert into newversionoftable(invid, vendorid, cost, timestamp, chdeleted)
    select invid, vendorid, cost, timestamp, chdeleted
    from table
    group by invid, vendorid, cost, timestamp, chdeleted

如果您选择删除，我建议：

（1）首先修复代码，因此不会重复。（2）确定重复的ID并将它们放在一个单独的表中。（3）批量删除。

要查找重复的ID，请使用以下内容：

    select *
    from (select id,
                 row_number() over (partition by invid, vendorid, cost, timestamp, chdeleted order by timestamp) as seqnum
          from table
         ) t
    where seqnum > 1

如果您想保留最新版本，请在order by子句中使用“timestamp desc”。

删除冗余的SQL价格成本记录

2 个答案: