在表中查找重复值(并获取其pk的)

时间:2013-08-08 14:39:34

标签: oracle plsql oracle11g subquery

我在选择一些我已经简化为下面示例的值时遇到了问题。基本上,我有一张这样的表:

CREATE TABLE sample_table
(
  pk_id        NUMBER,
  business_id  NUMBER
)

现在这个表中的一些business_id是重复的,我需要知道那些记录的pk。

让我们假设我(进一步)建立并填充表格如下:

ALTER TABLE sample_table ADD (
  CONSTRAINT sample_table_PK
 PRIMARY KEY
 (pk_id));

 create sequence sample_sequence;

 create trigger sample_trigger before insert on sample_table for each row 
 begin
    :new.pk_id := sample_sequence.nextval; 
 end;


 insert into sample_table (business_id) values (1000);
 insert into sample_table (business_id) values (1001);
 insert into sample_table (business_id) values (1002);
 insert into sample_table (business_id) values (1003);
 insert into sample_table (business_id) values (1003);
 insert into sample_table (business_id) values (1004);

现在弄清楚business_id是重复的很容易:

  SELECT   business_id, COUNT (business_id)
    FROM   sample_table
GROUP BY   business_id
  HAVING   COUNT (business_id) > 1;

但我不想要business_id,我想要pk_id的。

我可以使用上面的查询作为子查询来获取它们:

select * from sample_table where business_id in (
  SELECT   business_id
    FROM   sample_table
GROUP BY   business_id
  HAVING   COUNT (business_id) > 1);

或使用COUNT(*)OVER PARTITION BY和子查询因子分析

with q as 
(SELECT   business_id, COUNT ( * ) OVER (PARTITION BY business_id) totalcount
  FROM   sample_table)
select * from q
where q.totalcount > 1

但是他们两个都让我的查询很慢(这个示例的工作正常,但是当我处理大约500.000行的生产数据时,性能不是那么好)所以我想知道是否有更好的方法这样做。

2 个答案:

答案 0 :(得分:2)

因为它与表格和PK索引一致,所以第一个查询:

SELECT * from sample_table where business_id in (
  SELECT   business_id
    FROM   sample_table
GROUP BY   business_id
  HAVING   COUNT (business_id) > 1);

将需要进行全表扫描以评估子查询,然后在给定找到的business_id列表的情况下,主查询还需要进行全面扫描(PK索引对此没有任何用处。)你'我会看到这样的计划:

-----------------------------------------------...
| Id  | Operation             | Name         | ...
-----------------------------------------------...
|   0 | SELECT STATEMENT      |              | ...
|*  1 |  HASH JOIN RIGHT SEMI |              | ...
|   2 |   VIEW                | VW_NSO_1     | ...
|*  3 |    FILTER             |              | ...
|   4 |     HASH GROUP BY     |              | ...
|   5 |      TABLE ACCESS FULL| SAMPLE_TABLE | ...
|   6 |   TABLE ACCESS FULL   | SAMPLE_TABLE | ...
-----------------------------------------------...

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("BUSINESS_ID"="BUSINESS_ID")
   3 - filter(COUNT(*)>1)

在business_id和pk_id(按此顺序)上抛出一个唯一索引,您应该可以放弃第二个表扫描并使用索引仅查找重复的business_id。 (第一个表扫描是不可避免的,因为它必须检查所有行以进行可能的复制。)使用复合索引,Oracle可以查找business_id并同时获取pk_id,而无需跳回到表中。

-------------------------------------------------...
| Id  | Operation             | Name            |...
-------------------------------------------------...
|   0 | SELECT STATEMENT      |                 |...
|   1 |  NESTED LOOPS         |                 |...
|   2 |   VIEW                | VW_NSO_1        |...
|*  3 |    FILTER             |                 |...
|   4 |     HASH GROUP BY     |                 |...
|   5 |      TABLE ACCESS FULL| SAMPLE_TABLE    |...
|*  6 |   INDEX RANGE SCAN    | BUSINESS_ID_IDX |...
-------------------------------------------------...

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - filter(COUNT(*)>1)                          
   6 - access("BUSINESS_ID"="BUSINESS_ID")

如果重复是例外,这应该工作得很好。如果最糟糕的情况是所有business_id都是重复的,那么索引查找可能会变得很难看。

你可以尝试一下像这样有趣的东西:

SELECT business_id, LISTAGG(pk_id) WITHIN GROUP (ORDER BY pk_id)
FROM sample_table
GROUP BY business_id
HAVING COUNT(*) > 1

现在你只进行一次全表扫描,但现在所有的pk_ids都粘在同一行上。

答案 1 :(得分:0)

有几种方法可以做到,我更喜欢使用JOIN,因为这可以加快查询速度

SELECT   
  DISTINCT a.pk_id
FROM   
  sample_table a
  JOIN sample_table b ON ( a.pk_id <> b.pk_id AND a.business_id = b.business_id )

此外,business_id wolud帮助的索引