Question

我想基于各种表中的多个列来识别数据库中的重复项。在下面的示例中，1＆amp; 5和2＆amp; 4是重复的 - 因为所有四列具有相同的值。如何使用sql识别此类记录？当我必须基于单个列识别重复项时，我使用了计数> 1的组，但我不确定如何基于多列来识别它们。但是，我发现，当我根据所有4列计数＆gt; 1进行分组时，＃3和6会显示出来，但从技术上讲，它们并不符合我的要求。

T1

ID | Col1 | Col2  
---| ---  | ---  
1  |   A  |  US  
2  |   B  |  FR  
3  |   C  |  AU  
4  |   B  |  FR  
5  |   A  |  US  
6  |   D  |  UK

T2

ID | Col1  
---| ---              
1  | Apple  
1  | Kiwi
2  | Pear  
3  | Banana 
3  | Banana 
4  | Pear  
5  | Apple

T3

ID | Col1     
---|  ---   
1  | Spinach  
1  | Beets
2  | Celery  
3  | Radish  
4  | Celery  
5  | Spinach  
6  | Celery
6  | Celery

我的预期结果是：

1 A US Apple Spinach  
5 A US Apple Spinach  
2 B FR Pear  Celery  
4 B FR Pear  Celery

Answer 1

问题是您的结果集需要包含唯一的ID列。所以一个直截了当的GROUP BY ... HAVING不会削减它。这样可行。

 with cte as
     ( select t1.id
              , t1.col1 as t1_col1 
              , t1.col2 as t1_col2 
              , t2.col1 as t2_col1 
              , t3.col1 as t3_col1 
       from t1
            join t2 on t1.id = t2.id
            join t3 on t1.id = t3.id
     )
 select cte.*
 from cte
 where (t1_col1, t1_col2, t2_col1, t3_col1) in 
      ( select t1_col1, t1_col2, t2_col1, t3_col1
        from cte 
        group by t1_col1, t1_col2, t2_col1, t3_col1 having count(*) > 1)
 /

子查询分解语法的使用是可选的，但我发现在查询中使用多个子查询的信号很有用。

＆＃34;我在数据中遇到了另一种情况，一些ID在T2和T3中具有相同的值，它们显示为重复。＆＃34;

子表中的重复ID会导致连接子查询中出现笛卡尔积，从而导致主结果集中出现误报。理想情况下，您应该能够通过在这些表上引入其他过滤器来删除不需要的行来处理此问题。但是，如果数据质量太差而且没有有效的规则，则必须依靠distinct：

with cte as ( 
    select t1.id 
         , t1.col1 as t1_col1 
         , t1.col2 as t1_col2
          , t2.col1 as t2_col1
          , t3.col1 as t3_col1 
    from t1 
      join ( select distinct id, col1 from t2) t2 on t1.id = t2.id
      join ( select distinct id, col1 from t3) t3 on t1.id = t3.id
 ) ...

Answer 2

您可以在group by子句中添加要查找副本的所有列，然后在具有claus的情况下写入计数条件

select t1.id,t1.col1,t2.col2,t2.col3,t3.col4 from t1 join t2 on t1.id=t2.id join t3 on t3.id=t1.id where (t1.col1,t2.col2,t2.col3,t3.col4) in (
    select t1.col1,t2.col2,t2.col3,t3.col4
    from t1 join t2 on t1.id=t2.id join t3 on t3.id=t1.id
    group by t1.col1,t2.col2,t2.col3,t3.col4
    having count(*) >1  )

Answer 3

对于您的示例数据，您可以使用inner join-ing所有三个表并使用group by tA.Col1 having count(tA.Col1)>1子句子查询中的where来实现此目的，以获得所需的结果。

SELECT t1.ID,
       t1.Col1,
       t1.Col2,
       t2.Col1,
       t3.Col1
FROM table1 t1
JOIN table2 t2 ON t1.ID = t2.ID
JOIN table3 t3 ON t1.ID = t3.ID
WHERE t1.Col1 IN
    ( SELECT tA.Col1
     FROM table1 tA
     GROUP BY tA.Col1
     HAVING count(tA.Col1)>1)
ORDER BY t1.ID;

<强>结果

ID  Col1    Col2    Col1    Col1
-----------------------------------
1   A        US     Apple   Spinach
2   B        FR     Pear    Celery
4   B        FR     Pear    Celery
5   A        US     Apple   Spinach

您可以查看演示here

希望这会有所帮助。

根据多个列识别重复项

3 个答案: