基于三列删除结果集重复项

时间:2014-10-05 01:15:29

标签: sql oracle oracle11g

我有一个复杂的oracle视图,它返回在返回的行中具有逻辑副本的数据。我的目标是在基于两列(文本和日期时间)找到重复项时仅检索一行,但是要确定要返回哪一个重复项将基于第三列(日期时间)。

我已将下面的结果集模拟到带有存根数据的表中(在SQLFiddle上找到here):

CREATE TABLE TimeTable (
  ID number NOT NULL,
  NAME VARCHAR2(20) NOT NULL,      -- Grouped by this first
  TARGETVALUE INT,                 -- ultimate target value to be returned (no precedence from this value)
  NOTE VARCHAR2(20) NULL,          -- Just a note for the developer on StackOverflow
  BEGIN_DATE TIMESTAMP NOT NULL,   -- Grouped by this 2nd (down to the minute, not seconds) 
  APPROVAL_DATE TIMESTAMP NOT NULL -- Decides the ties for duplicates

 );

 insert into TimeTable (ID, NAME, TARGETVALUE, NOTE, BEGIN_DATE, APPROVAL_DATE) values 
(1, 'Alpha', 5,  'Duplicate First', '08-MAR-14 09.43.00.000000000', 
                                    '09-MAR-14 09.43.00.000000000');

 insert into TimeTable (ID, NAME, TARGETVALUE, NOTE, BEGIN_DATE, APPROVAL_DATE) values 
(2, 'Alpha', 2,  'Duplicate Middle', '08-MAR-14 09.43.00.000000000', 
                                     '09-MAR-14 09.43.00.000000000');


 insert into TimeTable (ID, NAME, TARGETVALUE, NOTE, BEGIN_DATE, APPROVAL_DATE) values 
(3, 'Alpha', 3, 'Final Target', '08-MAR-14 09.43.00.000000000', 
                                '09-MAR-14 10.00.00.000000000');

-- Same time as alpha, but not related.
 insert into TimeTable (ID, NAME, TARGETVALUE, NOTE, BEGIN_DATE, APPROVAL_DATE) values 
(4, 'Beta', 4, 'Only Target', '08-MAR-14 09.43.30.000000000', 
                              '09-MAR-14 11.00.30.000000000');

需要的结果集是2行

3, 'Alpha', 3, '08-MAR-14 09.43.00.000000000', '09-MAR-14 10.00.00.000000000'
4, 'Beta', 4, '08-MAR-14 09.43.30.000000000' '09-MAR-14 11.00.30.000000000'

如果我在数据库中有这个值,请注意澄清

5, 'Alpha', 8, '09-MAR-14 09.43.00.000000000', '12-MAR-14 10.00.00.000000000'

然后该Alpha集将是唯一的并且返回,因为由于不同的BEGIN_DATE(即3月9日而不是8日),它不被视为重复。


以下是遵循的规则

  1. NAME与数据相关。
  2. BEGIN_DATE是第二种关系,其中直到分钟的确切时间将有重复,需要将其淘汰以基于#3。
  3. 如果每个#1和#2都有重复项,则会根据最新APPROVAL_DATE确定删除它们,这些<{1}}将在之前的日期赢取

2 个答案:

答案 0 :(得分:2)

根据所提到的规则聚合数据应该是ANALYTICS的简单实现。

您希望每组MAX中的APPROVAL DATE NAME, BEGIN_DATE。所以,你需要做的就是:

MAX(APPROVAL_DATE) OVER(PARTITION BY NAME, BEGIN_DATE ORDER BY APPROVAL_DATE DESC) max_appr_dt

并且,在您的外部查询中,只需使用DUPLICATES中的WHERE APPROVAL_DATE = max_aapr_dt过滤掉PREDICATE

注意PERFORMANCE的角度来看,此方法仅执行一次TABLE SCAN 。因此,比加入表格和进行多表扫描

的其他方法要好得多

更新按照评论中的要求添加完整的测试用例

使用分析有两种方法:

<强> 1.MAX

SQL> SELECT *
  2  FROM
  3    (SELECT A.*,
  4      MAX(APPROVAL_DATE) OVER(PARTITION BY NAME, BEGIN_DATE ORDER BY APPROVAL_DATE DESC) max_appr_dt
  5    FROM TIMETABLE A
  6    )
  7  WHERE approval_date = max_appr_dt
  8  /

        ID NAME                 TARGETVALUE NOTE                 BEGIN_DATE                     APPROVAL_DATE                  MAX_APPR_DT
---------- -------------------- ----------- -------------------- ------------------------------ ------------------------------ ------------------------------
         3 Alpha                          3 Final Target         08-MAR-14 09.43.00.000000 AM   09-MAR-14 10.00.00.000000 AM   09-MAR-14 10.00.00.000000 AM
         4 Beta                           4 Only Target          08-MAR-14 09.43.30.000000 AM   09-MAR-14 11.00.30.000000 AM   09-MAR-14 11.00.30.000000 AM

<强> 2.ROW_NUMBER()

SQL> SELECT *
  2  FROM
  3    (SELECT a.*,
  4      row_number() OVER(PARTITION BY NAME, BEGIN_DATE ORDER BY APPROVAL_DATE DESC) AS "RNK"
  5    FROM TIMETABLE A
  6    )
  7  WHERE rnk =1
  8  /

        ID NAME                 TARGETVALUE NOTE                 BEGIN_DATE                     APPROVAL_DATE                         RNK
---------- -------------------- ----------- -------------------- ------------------------------ ------------------------------ ----------
         3 Alpha                          3 Final Target         08-MAR-14 09.43.00.000000 AM   09-MAR-14 10.00.00.000000 AM            1
         4 Beta                           4 Only Target          08-MAR-14 09.43.30.000000 AM   09-MAR-14 11.00.30.000000 AM            1

两个查询的执行计划:

SQL> EXPLAIN PLAN FOR
  2  SELECT *
  3  FROM
  4    (SELECT A.*,
  5      MAX(APPROVAL_DATE) OVER(PARTITION BY NAME, BEGIN_DATE ORDER BY APPROVAL_DATE DESC) max_appr_dt
  6    FROM TIMETABLE A
  7    )
  8  WHERE approval_date = max_appr_dt
  9  /

Explained.

SQL>
SQL> select * from table(dbms_xplan.display)
  2  /

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Plan hash value: 2691156688

---------------------------------------------------------------------------------
| Id  | Operation           | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |           |     4 |   356 |     3   (0)| 00:00:01 |
|*  1 |  VIEW               |           |     4 |   356 |     3   (0)| 00:00:01 |
|   2 |   WINDOW SORT       |           |     4 |   304 |     3   (0)| 00:00:01 |
|   3 |    TABLE ACCESS FULL| TIMETABLE |     4 |   304 |     3   (0)| 00:00:01 |
---------------------------------------------------------------------------------


PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("APPROVAL_DATE"="MAX_APPR_DT")

Note
-----
   - dynamic statistics used: dynamic sampling (level=2)

19 rows selected.

SQL>
SQL> EXPLAIN PLAN FOR
  2  SELECT *
  3  FROM
  4    (SELECT a.*,
  5      row_number() OVER(PARTITION BY NAME, BEGIN_DATE ORDER BY APPROVAL_DATE DESC) AS "RNK"
  6    FROM TIMETABLE A
  7    )
  8  WHERE rnk =1
  9  /

Explained.

SQL>
SQL> select * from table(dbms_xplan.display)
  2  /

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Plan hash value: 3768566268

--------------------------------------------------------------------------------------
| Id  | Operation                | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT         |           |     4 |   356 |     3   (0)| 00:00:01 |
|*  1 |  VIEW                    |           |     4 |   356 |     3   (0)| 00:00:01 |
|*  2 |   WINDOW SORT PUSHED RANK|           |     4 |   304 |     3   (0)| 00:00:01 |
|   3 |    TABLE ACCESS FULL     | TIMETABLE |     4 |   304 |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------------------


PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("RNK"=1)
   2 - filter(ROW_NUMBER() OVER ( PARTITION BY "NAME","BEGIN_DATE" ORDER BY
              INTERNAL_FUNCTION("APPROVAL_DATE") DESC )<=1)

Note
-----
   - dynamic statistics used: dynamic sampling (level=2)

21 rows selected.

答案 1 :(得分:0)

我知道您使用的是Oracle DB。但是,我使用SQL服务器测试了这个。 SQL应该适用于所有DB。尝试我的查询。我不确定这是否是最有效的方法。如果这有帮助,请告诉我。

select t.ID, t.name, t.targetvalue, t.begin_date, t.approval_date
from
(
select name, begin_date, max(approval_date) as approval_date
from timetable
group by name, begin_date
) as mx
inner join timetable as t
on mx.name = t.name and
mx.begin_date = t.begin_date and
mx.approval_date = t.approval_date

额外查询 - 如果要在SQL Server中的问题中创建表 -

CREATE TABLE TimeTable (
  ID int NOT NULL,
  NAME VARCHAR(20) NOT NULL,      
  TARGETVALUE INT,                
  NOTE VARCHAR(20) NULL,          
  BEGIN_DATE datetime NOT NULL,  
  APPROVAL_DATE datetime NOT NULL 

 );

 insert into TimeTable (ID, NAME, TARGETVALUE, NOTE, BEGIN_DATE, APPROVAL_DATE) values 
(1, 'Alpha', 5,  'Duplicate First', '08-03-14 09:43:00', 
                                    '09-03-14 09:43:00');

 insert into TimeTable (ID, NAME, TARGETVALUE, NOTE, BEGIN_DATE, APPROVAL_DATE) values 
(2, 'Alpha', 2,  'Duplicate Middle', '08-03-14 09:43:00', 
                                     '09-03-14 09:43:00');


 insert into TimeTable (ID, NAME, TARGETVALUE, NOTE, BEGIN_DATE, APPROVAL_DATE) values 
(3, 'Alpha', 3, 'Final Target', '08-03-14 09:43:00', 
                                '09-03-14 10:00:00');

-- Same time as alpha, but not related:
 insert into TimeTable (ID, NAME, TARGETVALUE, NOTE, BEGIN_DATE, APPROVAL_DATE) values 
(4, 'Beta', 4, 'Only Target', '08-03-14 09:43:30', 
                              '09-03-14 11:00:30');