查找几乎重复的记录

时间:2018-08-02 09:47:46

标签: sql oracle

我正在寻找一种方法来检索可能“或多或少”重复的同一记录的数据。

样本数据:

+----+----------+------+--------------------------+
| ID |   Date   | Item |        Descripion        |
+----+----------+------+--------------------------+
| 11 | 1/1/2018 | CPU  | CPU needs replacement    |
| 11 | 1/2/2018 | CPU  | CPU requires replacement |
| 12 | 1/1/2018 | CPU  | CPU needs replacement    |
+----+----------+------+--------------------------+

前两个记录重复,而最后一个记录不重复。

逻辑

如果具有相同的ID,并且时间跨度小于或等于2天,则保留相同的项目。

输出

按ID排序的数据集,其中包含几乎重复的数据。

3 个答案:

答案 0 :(得分:1)

首先,您不应使用Oracle保留关键字作为列名,例如DATE,因为您必须一直将其用双引号引起来。

现在,我相信您需要类似以下内容的内容,但是如果没有预期的输出,很难说清楚。另外,您应该尝试提供更好的结果集。在这种情况下,如果您连续几天有相同的ID,并且在某天或那几天的差异少于2天,则将获得所有行。

仅获取相差小于等于2天的记录,请使用

SQL Fiddle

SELECT ID,"DATE",ITEM,DESCRIPTION 
 FROM   
    (SELECT T.*, 
        LEAD(TRUNC("DATE"), 1) OVER ( PARTITION BY ID ORDER BY "DATE") 
        - 
        TRUNC("DATE") 
    AS DIF1,

        TRUNC("DATE")
        -
        LAG(TRUNC("DATE"), 1) OVER (PARTITION BY ID ORDER BY "DATE") 
    AS DIF2
    FROM   FOCUS_SAMPLE T                   
    ) T1 
  WHERE  T1.DIF1 <= 2 OR T1.DIF2 <=2

要获取所有记录,以防万一甚至有一场比赛使用

SQL Fiddle

SELECT * 
FROM   FOCUS_SAMPLE 
WHERE  ID IN (SELECT ID 
              FROM   (SELECT T.*, 
                             LEAD(TRUNC("DATE"), 1) 
                               OVER ( 
                                 PARTITION BY ID 
                                 ORDER BY "DATE") - TRUNC("DATE") AS DIF 
                      FROM   FOCUS_SAMPLE T) T1 
              WHERE  T1.DIF <= 2) 

答案 1 :(得分:0)

尝试类似的方法, 在这里,我们使用rowid删除重复的行。

create table temp as
select 11 id,sysdate mdate,'CPU' item,' CPU needs replacement' description from dual union all
select 11 id,sysdate-2 mdate,'CPU' item,'  CPU requires replacement' description from dual union all
select 12 id,sysdate mdate,'CPU' item,' CPU needs replacement' description from dual ;

供选择:

select * from temp where id in (
select  id from temp a where rowid not in (select max(rowid) from temp b where a.id=b.id and b.mdate  between a.mdate-2 and a.mdate  )
) order by id ;

要删除:

delete * from temp a where rowid not in (select max(rowid) from temp b where a.id=b.id and b.mdate  between a.mdate-2 and a.mdate  );

答案 2 :(得分:0)

如果您希望结果是“无重复”的,则可以使用NOT EXISTS来筛选两天以内存在较早记录的行。

SELECT *
       FROM "ELBAT" "T1"
       WHERE NOT EXISTS (SELECT *
                                FROM "ELBAT" "T2"
                                WHERE "T2"."ID" = "T1"."ID"
                                      AND "T2"."ITEM" = "T1"."ITEM"
                                      AND "T2"."ROWID" <> "T1"."ROWID"
                                      AND "T1"."DATE" - "T2"."DATE" >= 0
                                      AND "T1"."DATE" - "T2"."DATE" <= 2);

如果只希望有“重复项”,则可以使用EXISTS仅保留行,该行存在正负两天的另一条记录。

SELECT *
       FROM "ELBAT" "T1"
       WHERE EXISTS (SELECT *
                            FROM "ELBAT" "T2"
                            WHERE "T2"."ID" = "T1"."ID"
                                  AND "T2"."ITEM" = "T1"."ITEM"
                                  AND "T2"."ROWID" <> "T1"."ROWID"
                                  AND ABS("T1"."DATE" - "T2"."DATE") <= 2);

db<>fiddle