Question

我正在尝试优化生产中需要很长时间的查询。目标是根据匹配的字段值条件查找重复记录，然后删除它们。当前查询通过t1.col1 = t2.col1上的内连接使用自连接，然后使用where子句检查值。

select * from table t1 
inner join table t2 on t1.col1 = t2.col1
where t1.col2 = t2.col2 ...

最好的办法是什么？或者根据索引是否完全相同？也许

select * from table t1, table t2
where t1.col1 = t2.col1, t2.col2 = t2.col2 ...

此表有100米+行。

MS SQL，SQL Server 2008 Enterprise

select distinct t2.id
    from table1 t1 with (nolock)
    inner join table1 t2 with (nolock) on  t1.ckid=t2.ckid
    left join table2 t3 on t1.cid = t3.cid and t1.typeid = t3.typeid
    where 
    t2.id > @Max_id and
    t2.timestamp > t1.timestamp and
    t2.rid = 2 and
    isnull(t1.col1,'') = isnull(t2.col1,'') and 
    isnull(t1.cid,-1) = isnull(t2.cid,-1) and
    isnull(t1.rid,-1) = isnull(t2.rid,-1)and 
    isnull(t1.typeid,-1) = isnull(t2.typeid,-1) and
    isnull(t1.cktypeid,-1) = isnull(t2.cktypeid,-1) and
    isnull(t1.oid,'') = isnull(t2.oid,'') and
    isnull(t1.stypeid,-1) = isnull(t2.stypeid,-1)  

    and (
            (
                t3.uniqueoid = 1
            )
            or
            (
                t3.uniqueoid is null and 
                isnull(t1.col1,'') = isnull(t2.col1,'') and 
                isnull(t1.col2,'') = isnull(t2.col2,'') and
                isnull(t1.rdid,-1) = isnull(t2.rdid,-1) and 
                isnull(t1.stid,-1) = isnull(t2.stid,-1) and
                isnull(t1.huaid,-1) = isnull(t2.huaid,-1) and
                isnull(t1.lpid,-1) = isnull(t2.lpid,-1) and
                isnull(t1.col3,-1) = isnull(t2.col3,-1) 
            )
    )

Answer 1

为什么要自我加入：这是一个综合问题。

希望你有col1，col2，...的索引

--DELETE table
--WHERE KeyCol NOT IN (
select
    MIN(KeyCol) AS RowToKeep,
    col1, col2, 
from
    table
GROUP BY
    col12, col2
HAVING
   COUNT(*) > 1
--)

但是，这需要一些时间。有一个look at bulk delete techniques

Answer 2

您可以使用ROW_NUMBER（）在一个表中查找重复的行。

您可以查看here

Answer 3

您提供的两种方法应该是等效的。我认为大多数SQL引擎在两种情况下都会做同样的事情。

顺便说一句，这不会奏效。您必须至少有一个不同的字段或每个记录都匹配。

您可能想尝试更多类似的内容：

select col1, col2, col3
from table
group by col1, col2, col3
having count(*)>1

Answer 4

对于100m +行的表，将优化使用GROUPBY函数和使用保持表。即使它转化为四个查询。

第1步：创建一个持有键：

SELECT col1, col2, col3=count(*)
INTO holdkey
FROM t1
GROUP BY col1, col2
HAVING count(*) > 1

第2步：将所有重复条目推送到holddups中。这是第4步所必需的。

SELECT DISTINCT t1.*
INTO holddups
FROM t1, holdkey
WHERE t1.col1 = holdkey.col1
AND t1.col2 = holdkey.col2

第3步：从原始表中删除重复的行。

DELETE t1
FROM t1, holdkey
WHERE t1.col1 = holdkey.col1
AND t1.col2 = holdkey.col2

步骤4：将唯一的行放回原始表中。例如：

INSERT t1 SELECT * FROM holddups

Answer 5

要检测重复项，您无需加入：

SELECT col1, col2
FROM table
GROUP BY col1, col2
HAVING COUNT(*) > 1

那应该快得多。

Answer 6

根据我的经验，SQL Server的性能在OR条件下非常糟糕。可能它不是自联接，而是使用table3导致性能不佳。但是没有看到计划，我不确定。

在这种情况下，将查询拆分为两个可能会有所帮助：一个具有WHERE条件t3.uniqueoid = 1，一个具有表3上其他条件的WHERE条件，然后使用UNION ALL将一个附加到另一个条件。

检查重复项时最好的自联接技术

6 个答案: