如何查找重复的带时间戳的数据

时间:2011-11-16 21:07:51

标签: sql oracle plsql

我有一张表格如下:

create table issue_attributes (
  issue_id number,
  attr_timestamp timestamp,
  attribute_name varchar2(500),
  attribute_value varchar2(500),
  CONSTRAINT ia-pk PRIMARY KEY (issue_id, attr_timestamp, attribute_name)
)

这里的想法是拥有一系列与问题相关联的属性(状态,所有者等),同时保留保留属性更改历史记录的能力。

由于数据导入错误,我们在表格中重复了数据:

select issue_id, attr_timestamp, attribute_name, attribute_value
from issue_attributes where issue_id = 1 and attribute_name = 'OWNER';

产生以下样本数据:

1, 01-JAN-2011 12:00, 'OWNER', 'john.doe@example.com'
1, 01-FEB-2011 12:00, 'OWNER', 'john.doe@example.com'
1, 01-MAR-2011 12:00, 'OWNER', 'john.doe@example.com'
1, 01-APR-2011 12:00, 'OWNER', 'john.doe@example.com'

我希望能够找到重复属性的所有实例,并保留最新的属性。在这种情况下,样本数据的期望结果集将是:

1, 01-JAN-2011 12:00, 'OWNER', 'john.doe@example.com'

我们也可能有一个样本数据的例子:

2, 01-JAN-2011 12:00, 'OWNER', 'john.doe@example.com'
2, 01-FEB-2011 12:00, 'OWNER', 'jane.deere@example.com'
2, 01-MAR-2011 12:00, 'OWNER', 'john.doe@example.com'
2, 01-APR-2011 12:00, 'OWNER', 'john.doe@example.com'

在这种情况下,我希望得到结果:

2, 01-JAN-2011 12:00, 'OWNER', 'john.doe@example.com'
2, 01-FEB-2011 12:00, 'OWNER', 'jane.deere@example.com'
2, 01-MAR-2011 12:00, 'OWNER', 'john.doe@example.com'

这是在Oracle 11g上,所以我可以使用SQL或PL / SQL来修复数据。我认为一种方法是通过PL / SQL,对于每个issue_id,下行排序属性,如果属性(x)=属性(x-1),则删除属性(x)。这看起来有点像蛮力,我很想知道是否有一种优雅的方法可以通过SQL实现这一点。

4 个答案:

答案 0 :(得分:1)

这是一个很好的" Oracle"这样做的方法。

使用您的样本数据:

SQL> desc issue_attributes
 Name                                                              Null?    Type
 ----------------------------------------------------------------- -------- --------------------------------------------
 ISSUE_ID                                                                   NUMBER
 ATTR_TIMESTAMP                                                             TIMESTAMP(6)
 ATTRIBUTE_NAME                                                             VARCHAR2(500)
 ATTRIBUTE_VALUE                                                            VARCHAR2(500)

SQL> select * from issue_attributes;

  ISSUE_ID ATTR_TIMESTAMP                      ATTRIBUTE_ ATTRIBUTE_VALUE
---------- ----------------------------------- ---------- ------------------------------
         1 01-JAN-20 11.12.00.000000 AM        OWNER      john.doe@example.com
         1 01-FEB-20 11.12.00.000000 AM        OWNER      john.doe@example.com
         1 01-MAR-20 11.12.00.000000 AM        OWNER      john.doe@example.com
         1 01-APR-20 11.12.00.000000 AM        OWNER      john.doe@example.com
         1 01-JAN-20 11.12.00.000000 AM        OWNER      john.doe@example.com
         1 01-JAN-20 11.12.00.000000 AM        OWNER      john.doe@example.com
         1 01-FEB-20 11.12.00.000000 AM        OWNER      jane.deere@example.com
         1 01-MAR-20 11.12.00.000000 AM        OWNER      john.doe@example.com
         1 01-APR-20 11.12.00.000000 AM        OWNER      john.doe@example.com
         1 01-JAN-20 11.12.00.000000 AM        OWNER      john.doe@example.com
         1 01-FEB-20 11.12.00.000000 AM        OWNER      jane.deere@example.com
         1 01-MAR-20 11.12.00.000000 AM        OWNER      john.doe@example.com

12 rows selected.

SQL> delete from issue_attributes
        where rowid in(select rid
                         from (select rowid rid,
                                      row_number() over (partition by ISSUE_ID,
                                                                      ATTR_TIMESTAMP,
                                                                      ATTRIBUTE_NAME,
                                                                      ATTRIBUTE_VALUE
                                                             order by rowid) rn
                                from issue_attributes)
                        where rn<> 1);
7 rows deleted.

SQL> select * from issue_attributes;

  ISSUE_ID ATTR_TIMESTAMP              ATTRIBUTE_ ATTRIBUTE_VALUE
---------- ----------------------------------- ---------- ------------------------------
         1 01-JAN-20 11.12.00.000000 AM        OWNER      john.doe@example.com
         1 01-FEB-20 11.12.00.000000 AM        OWNER      john.doe@example.com
         1 01-MAR-20 11.12.00.000000 AM        OWNER      john.doe@example.com
         1 01-APR-20 11.12.00.000000 AM        OWNER      john.doe@example.com
         1 01-FEB-20 11.12.00.000000 AM        OWNER      jane.deere@example.com

5 rows selected.

希望有所帮助。

答案 1 :(得分:1)

我会查看前一行并查看数据是否已更改。这可以通过使用LAG分析函数来完成。

您可以回顾之前的值,在时间戳上排序。如果数据已更改,那么您希望保留它。第一行始终保留,因为LAG在没有先前数据时返回NULL

with issue_attributes as (
  select 1 as issue_id, date '2011-01-01' as attr_timestamp, 
    'OWNER' as attribute_name, 'john.doe@example.com' as attribute_value from dual union all
  select 1 as issue_id, date '2011-02-01' as attr_timestamp, 
    'OWNER' as attribute_name, 'john.doe@example.com' as attribute_value from dual union all
  select 1 as issue_id, date '2011-03-01' as attr_timestamp, 
    'OWNER' as attribute_name, 'john.doe@example.com' as attribute_value from dual union all
  select 1 as issue_id, date '2011-04-01' as attr_timestamp, 
    'OWNER' as attribute_name, 'john.doe@example.com' as attribute_value from dual union all
  select 2 as issue_id, date '2011-01-01' as attr_timestamp, 
    'OWNER' as attribute_name, 'john.doe@example.com' as attribute_value from dual union all
  select 2 as issue_id, date '2011-02-01' as attr_timestamp, 
    'OWNER' as attribute_name, 'jane.deere@example.com' as attribute_value from dual union all
  select 2 as issue_id, date '2011-03-01' as attr_timestamp, 
    'OWNER' as attribute_name, 'john.doe@example.com' as attribute_value from dual union all
  select 2 as issue_id, date '2011-04-01' as attr_timestamp, 
    'OWNER' as attribute_name, 'john.doe@example.com' as attribute_value from dual 
)
select 
  issue_id, 
  attr_timestamp, 
  attribute_name, 
  attribute_value,
  case when lag(attribute_value) over (partition by issue_id, attribute_name order by attr_timestamp) = attribute_value then null else 'Y'end as keep_value
from 
  issue_attributes

这将添加一个额外的列来说明是否需要保留数据,然后您可以对其进行过滤:

ISSUE_ID ATTR_TIMESTAMP ATTRIBUTE_NAME ATTRIBUTE_VALUE        KEEP_VALUE
1        01/01/2011     OWNER          john.doe@example.com   Y
1        01/02/2011     OWNER          john.doe@example.com 
1        01/03/2011     OWNER          john.doe@example.com 
1        01/04/2011     OWNER          john.doe@example.com 
2        01/01/2011     OWNER          john.doe@example.com   Y
2        01/02/2011     OWNER          jane.deere@example.com Y
2        01/03/2011     OWNER          john.doe@example.com   Y
2        01/04/2011     OWNER          john.doe@example.com 

答案 2 :(得分:0)

我特别不了解Oracle,但有点像

SELECT MAX(attr_timestamp), issue_id, attribute_name, attribute_value
FROM issue_attributes
GROUP BY issue_id, attribute_name, attribute_value

会在一些DBMS中生成一个列表,其中显示每个不同的三元组issue_id, attribute_name, attribute_value以及最近的时间戳。可能值得一试。

答案 3 :(得分:0)

您要检测的是:具有相同{issueid,attributename,attributevalue}的元组,但(在按时间戳排序时) no 使用相同的{issueid,attributename}进行干预的元组但是不同的{}的AttributeValue。

可以用一个EXISTS和一个NOT EXISTS子查询写成一个查询。

更新:

SET search_path='tmp';

-- The rows you want to delete.
SELECT * FROM issue_attributes to_del
WHERE EXISTS (
    SELECT * FROM issue_attributes xx
    WHERE xx.issue_id = to_del.issue_id
    AND xx.attribute_name = to_del.attribute_name
    AND xx.attribute_value = to_del.attribute_value
    AND xx.attr_timestamp > to_del.attr_timestamp
    AND NOT EXISTS ( SELECT * FROM issue_attributes nx
        WHERE nx.issue_id = to_del.issue_id
        AND nx.attribute_name = to_del.attribute_name
        AND nx.attribute_value <> to_del.attribute_value
        AND nx.attr_timestamp > to_del.attr_timestamp
        AND nx.attr_timestamp < xx.attr_timestamp
        )   
    ) ;

-- For completeness: the rows you want to keep.
SELECT * FROM issue_attributes must_stay
WHERE NOT EXISTS (
    SELECT * FROM issue_attributes xx
    WHERE xx.issue_id = must_stay.issue_id
    AND xx.attribute_name = must_stay.attribute_name
    AND xx.attribute_value = must_stay.attribute_value
    AND xx.attr_timestamp > must_stay.attr_timestamp
    AND NOT EXISTS ( SELECT * FROM issue_attributes nx
        WHERE nx.issue_id = must_stay.issue_id
        AND nx.attribute_name = must_stay.attribute_name
        AND nx.attribute_value <> must_stay.attribute_value
        AND nx.attr_timestamp > must_stay.attr_timestamp
        AND nx.attr_timestamp < xx.attr_timestamp
        )
    ) ;

结果:

 issue_id |   attr_timestamp    | attribute_name |   attribute_value    
----------+---------------------+----------------+----------------------
        1 | 2011-03-01 12:00:00 | OWNER          | john.doe@example.com
        1 | 2011-01-01 12:00:00 | OWNER          | john.doe@example.com
        1 | 2011-02-01 12:00:00 | OWNER          | john.doe@example.com
        2 | 2011-03-01 12:00:00 | OWNER          | john.doe@example.com
(4 rows)

 issue_id |   attr_timestamp    | attribute_name |    attribute_value     
----------+---------------------+----------------+------------------------
        1 | 2011-04-01 12:00:00 | OWNER          | john.doe@example.com
        2 | 2011-02-01 12:00:00 | OWNER          | jane.deere@example.com
        2 | 2011-04-01 12:00:00 | OWNER          | john.doe@example.com
        2 | 2011-01-01 12:00:00 | OWNER          | john.doe@example.com
(4 rows)