SQL Query消除类似的条目

时间:2011-08-19 18:15:00

标签: sql algorithm duplicates combinations

我正在处理SQL Server 2008中的问题

我有一个包含六列的表格:

PK INT
dOne SmallINT
dTwo SmallINT
dThree SmallINT
dFour SmallINT
dFiveSmallINT
dSix SmallINT

该表包含大约一百万个recrods。值得注意的是,第n + 1列中的值>列n中的值,即97,98,99,120,135。我试图消除所有共有5个DIGITS的行(忽略PK),即:

76, 89, 99, 102, 155, 122
11, 89, 99, 102, 155, 122
89, 99, 102, 155, 122, 130

在这种情况下,算法应该从第一行开始并删除第二行和第三行,因为它们包含5个匹配的数字。第一行仍然存在。

我试图暴力破解解决方案但是只查找第一条记录的所有副本需要花费25秒,这意味着处理整个表格会花费太长时间(这应该是一个可重复的过程)。

我对SQL很新,但这就是我想出来的(我已经提出了一些解决方案,但没有一个是足够的......这是最新的尝试):

(我不会包含所有代码,但我会解释这个方法,如果有帮助我可以粘贴更多)

将记录n的数字保存到变量中。选择与记录n FROM largeTable共有一位数的所有记录。

将所有选定的数字插入#oneMatch,并将[matchingOne]包含在匹配的数字中。

选择与记录n共有一位数的所有记录从临时表WHERE'数字共同'!= [匹配]。将所有选定的数字插入#twoMatch并包含[matchingOne] AND [matchingTwo] ...

重复直到插入#fiveMatch。从largeTable中删除#fiveMatch并移至记录n + 1

我在实施此解决方案时遇到问题。如何根据WHERE子句分配匹配变量?

-- SELECT all records with ONE matching field:
INSERT INTO #oneMatch (ID_pk, dOne, dTwo, dThree, dFour, dFive, dSix, mOne)
SELECT ID_pk, dOne, dTwo, dThree, dFour, dFive, dSix
FROM dbo.BaseCombinationsExtended
WHERE  ( [dOne] IN (@dOne, @dTwo, @dThree, @dFour, @dFive, @dSix) **mOne = dOne?
      OR [dTwo] IN (@dOne, @dTwo, @dThree, @dFour, @dFive, @dSix) **mOne = dTwo?
      OR [dTwo] IN (@dOne, @dTwo, @dThree, @dFour, @dFive, @dSix) **mOne = dThree?
...
      OR [dSix] IN (@dOne, @dTwo, @dThree, @dFour, @dFive, @dSix) **mOne = dSix?
    )

我能够使用六个查询“伪造”上述内容,但效率太低......

很抱歉这么长的说明。任何帮助将不胜感激(我的上述尝试的新解决方案或实施)因为这个问题一直困扰着我......

3 个答案:

答案 0 :(得分:2)

除非我错过了什么,否则这应该会产生正确的结果。

declare @T table 
(
  PK INT identity primary key,
  dOne SmallINT,
  dTwo SmallINT,
  dThree SmallINT,
  dFour SmallINT,
  dFive SmallINT,
  dSix SmallINT
)

insert into @T values
(76, 89, 99, 102, 155, 122),
(11, 89, 99, 102, 155, 122),
(89, 99, 102, 155, 122, 130)

;with q1(PK, d1, d2, d3, d4, d5) as
(
  select PK, dTwo, dThree, dFour, dFive, dSix
  from @T
  union all
  select PK, dOne, dThree, dFour, dFive, dSix
  from @T
  union all
  select PK, dOne, dTwo, dFour, dFive, dSix
  from @T
  union all
  select PK, dOne, dTwo, dThree, dFive, dSix
  from @T
  union all
  select PK, dOne, dTwo, dThree, dFour, dSix
  from @T
  union all
  select PK, dOne, dTwo, dThree, dFour, dFive
  from @T
),
q2 as
(
  select PK,
         row_number() over(partition by d1, d2, d3, d4, d5 order by PK) as rn
  from q1
),
q3 as
(
  select PK
  from q2
  where rn = 1
  group by PK
  having count(*) = 6   
)
select T.*
from @T as T
  inner join q3 as Q
    on T.PK = Q.PK  

答案 1 :(得分:0)

编辑 - 以下方法可能优于N平方性能,具体取决于优化程序。如果所有5列都被索引,则每行只需要6个索引搜索,这仍然是N * logN。虽然看起来确实有点蠢。

您可以根据5场比赛的所有排列代码生成where条件:所以要删除的记录将由以下内容给出:

SELECT * FROM SillyTable ToDelete WHERE EXISTS
(
    SELECT PK From SillyTable Duplicate 
    WHERE (   (
            (Duplicate.dOne=ToDelete.dOne) 
            AND (Duplicate.dTwo=ToDelete.dTwo) 
            AND (Duplicate.dThree=ToDelete.dThree)
            AND (Duplicate.dFour=ToDelete.dFour)
            AND (Duplicate.dFive=ToDelete.dFive)
        ) OR (
            (Duplicate.dOne=ToDelete.dTwo) 
            AND (Duplicate.dTwo=ToDelete.dThree) 
            AND (Duplicate.dThree=ToDelete.dFour)
            AND (Duplicate.dFour=ToDelete.dFive)
            AND (Duplicate.dFive=ToDelete.dSix)
        ) OR (
            (Duplicate.dTwo=ToDelete.dOne) 
            AND (Duplicate.dThree=ToDelete.dTwo) 
            AND (Duplicate.dFour=ToDelete.dThree)
            AND (Duplicate.dFive=ToDelete.dFour)
            AND (Duplicate.dSix=ToDelete.dFive)
        ) OR (
            (Duplicate.dTwo=ToDelete.dTwo) 
            AND (Duplicate.dThree=ToDelete.dThree) 
            AND (Duplicate.dFour=ToDelete.dFour)
            AND (Duplicate.dFive=ToDelete.dFive)
            AND (Duplicate.dSix=ToDelete.dSix)
        ) ...                       

这继续涵盖所有36种组合(在连接的每一侧有一个不匹配,在6个可能的列中,因此6 * 6为您提供所有可能性)。我会编译生成这个,因为它打字很多,如果你想要明天6场比赛中的4场怎么办,但你可以手动编码我猜。

答案 2 :(得分:0)

我不能对性能做出任何承诺,但你可以试试这个。我做的第一件事就是将数据放入更规范化的结构中。

CREATE TABLE dbo.Test_Sets_Normalized (my_id INT NOT NULL, c SMALLINT NOT NULL)
GO

INSERT INTO dbo.Test_Sets_Normalized (my_id, c)
SELECT my_id, c1 FROM dbo.Test_Sets UNION ALL
SELECT my_id, c2 FROM dbo.Test_Sets UNION ALL
SELECT my_id, c3 FROM dbo.Test_Sets UNION ALL
SELECT my_id, c4 FROM dbo.Test_Sets UNION ALL
SELECT my_id, c5 FROM dbo.Test_Sets UNION ALL
SELECT my_id, c6 FROM dbo.Test_Sets
GO

SELECT DISTINCT
    T2.my_id
FROM
    (SELECT DISTINCT my_id FROM dbo.Test_Sets_Normalized) T1
INNER JOIN (SELECT DISTINCT my_id FROM dbo.Test_Sets_Normalized) T2 ON T2.my_id > T1.my_id
WHERE
    (
    SELECT
        COUNT(*)
    FROM
        dbo.Test_Sets_Normalized T3
    INNER JOIN dbo.Test_Sets_Normalized T4 ON
        T4.my_id = T2.my_id AND
        T4.c = T3.c
    WHERE
        T3.my_id = T1.my_id) >= 5

这应该可以为您提供所需的ID。一旦确认它符合您的要求,您就可以JOIN返回原始表并按ID删除。

可能在某处不需要DISTINCT的情况下可能有所改进。我会多考虑一下。