Question

我有一张桌子：

表1

unique_id       user_id   user_seq      col_name            value_val    position
1               100       1             test1               100          1
1               100       1             test2               123          1
1               100       1             test1               a            2
1               100       1             test2               text         2
1               100       1             test3               1Rw          2
1               100       1             test4               1Tes         2

2               101       1             test1               1            1
2               101       1             test2               1            1
2               101       1             test3               1            1
2               101       1             test4               1            1
2               101       1             test5               1            1

3               100       1             test1               100          1
3               100       1             test2               123          1
3               100       1             test1               a            2
3               100       1             test2               text         2
3               100       1             test3               1Rw          2
3               100       1             test4               1Tes         2

4               101       1             test1               1            1
4               101       1             test2               1            1
4               101       1             test3               1            1
4               101       1             test4               1            1

我需要根据以下内容查找重复项：

user_id，user_seq，col_name，value_val和position对于不同的unique_id应该完全相同。

在上面的示例中，unique_id-1和3完全相同，因此应将它们作为输出返回。

对于unique_id = 2和4，对于unique_id = 4不存在test5的区别，因此不会被捕获。

输出为：

unique_id
1
3

此外，我的数据集非常庞大，大约有5000万条记录，因此需要优化的解决方案。有帮助吗？

编辑

我的表结构：

Name        Null? Type           
----------- ----- --------------         
UNIQUE_ID          NUMBER         
USER_SEQ           VARCHAR2(100)          
COL_NAME           VARCHAR2(263)  
VALUE_VAL          VARCHAR2(4000) 
POSITION           NUMBER             
USER_ID            NUMBER

没有可用的索引。

Answer 1

这是一种方法：

with sample_data as (select 1 unique_id, 100 user_id, 1 user_seq, 'test1' col_name, '100' value_val, 1 position from dual union all
                     select 1 unique_id, 100 user_id, 1 user_seq, 'test2' col_name, '123' value_val, 1 position from dual union all
                     select 1 unique_id, 100 user_id, 1 user_seq, 'test1' col_name, 'a' value_val, 2 position from dual union all
                     select 1 unique_id, 100 user_id, 1 user_seq, 'test2' col_name, 'text' value_val, 2 position from dual union all
                     select 1 unique_id, 100 user_id, 1 user_seq, 'test3' col_name, '1Rw' value_val, 2 position from dual union all
                     select 1 unique_id, 100 user_id, 1 user_seq, 'test4' col_name, '1Tes' value_val, 2 position from dual union all
                     select 2 unique_id, 101 user_id, 1 user_seq, 'test1' col_name, '1' value_val, 1 position from dual union all
                     select 2 unique_id, 101 user_id, 1 user_seq, 'test2' col_name, '1' value_val, 1 position from dual union all
                     select 2 unique_id, 101 user_id, 1 user_seq, 'test3' col_name, '1' value_val, 1 position from dual union all
                     select 2 unique_id, 101 user_id, 1 user_seq, 'test4' col_name, '1' value_val, 1 position from dual union all
                     select 2 unique_id, 101 user_id, 1 user_seq, 'test5' col_name, '1' value_val, 1 position from dual union all
                     select 3 unique_id, 100 user_id, 1 user_seq, 'test1' col_name, '100' value_val, 1 position from dual union all
                     select 3 unique_id, 100 user_id, 1 user_seq, 'test2' col_name, '123' value_val, 1 position from dual union all
                     select 3 unique_id, 100 user_id, 1 user_seq, 'test1' col_name, 'a' value_val, 2 position from dual union all
                     select 3 unique_id, 100 user_id, 1 user_seq, 'test2' col_name, 'text' value_val, 2 position from dual union all
                     select 3 unique_id, 100 user_id, 1 user_seq, 'test3' col_name, '1Rw' value_val, 2 position from dual union all
                     select 3 unique_id, 100 user_id, 1 user_seq, 'test4' col_name, '1Tes' value_val, 2 position from dual union all
                     select 4 unique_id, 101 user_id, 1 user_seq, 'test1' col_name, '1' value_val, 1 position from dual union all
                     select 4 unique_id, 101 user_id, 1 user_seq, 'test2' col_name, '1' value_val, 1 position from dual union all
                     select 4 unique_id, 101 user_id, 1 user_seq, 'test3' col_name, '1' value_val, 1 position from dual union all
                     select 4 unique_id, 101 user_id, 1 user_seq, 'test4' col_name, '1' value_val, 1 position from dual union all
                     select 6 unique_id, 101 user_id, 1 user_seq, 'test1' col_name, '1' value_val, 1 position from dual union all
                     select 6 unique_id, 101 user_id, 1 user_seq, 'test2' col_name, '1' value_val, 1 position from dual union all
                     select 6 unique_id, 101 user_id, 1 user_seq, 'test3' col_name, '1' value_val, 1 position from dual union all
                     select 6 unique_id, 101 user_id, 1 user_seq, 'test4' col_name, '1' value_val, 1 position from dual union all
                     select 7 unique_id, 101 user_id, 1 user_seq, 'test1' col_name, '1' value_val, 1 position from dual union all
                     select 7 unique_id, 101 user_id, 1 user_seq, 'test2' col_name, '1' value_val, 1 position from dual union all
                     select 7 unique_id, 101 user_id, 1 user_seq, 'test3' col_name, '1' value_val, 1 position from dual union all
                     select 7 unique_id, 101 user_id, 1 user_seq, 'test4' col_name, '1' value_val, 1 position from dual union all
                     select 5 unique_id, 100 user_id, 1 user_seq, 'test1' col_name, '100' value_val, 1 position from dual union all
                     select 5 unique_id, 100 user_id, 1 user_seq, 'test2' col_name, '123' value_val, 1 position from dual union all
                     select 5 unique_id, 100 user_id, 1 user_seq, 'test1' col_name, 'a' value_val, 2 position from dual union all
                     select 5 unique_id, 100 user_id, 1 user_seq, 'test2' col_name, 'text' value_val, 2 position from dual union all
                     select 5 unique_id, 100 user_id, 1 user_seq, 'test3' col_name, '1Rw' value_val, 2 position from dual union all
                     select 5 unique_id, 100 user_id, 1 user_seq, 'test4' col_name, '1Tes' value_val, 2 position from dual),
            cnts as (select unique_id,
                            user_id,
                            user_seq,
                            col_name,
                            value_val,
                            position,
                            count(*) over (partition by unique_id) cnt
                     from   sample_data),
             res as (select distinct sd1.unique_id id1,
                                     sd2.unique_id id2,
                                     sd1.cnt,
                                     count(*) over (partition by sd1.unique_id, sd2.unique_id) total_id1_rows_cnt
                     from   cnts sd1
                            inner join cnts sd2 on sd1.unique_id < sd2.unique_id
                                                   and sd1.user_id = sd2.user_id
                                                   and sd1.user_seq = sd2.user_seq
                                                   and sd1.col_name = sd2.col_name
                                                   and sd1.value_val = sd2.value_val
                                                   and sd1.position = sd2.position
                                                   and sd1.cnt = sd2.cnt)
select id1||','||listagg(id2, ',') within group (order by id2) grouped_unique_ids
from   res
where  id1 not in (select id2
                   from   res)
and    cnt = total_id1_rows_cnt
group by id1
order by grouped_unique_ids;

然后here's the db<>fiddle证明它有效

Answer 2

如果性能不是问题，那么自我联接又如何呢？

select a.unique_id as unique_id
from table1 a join table1 b
on a.user_id = b.user_id
and a.user_seq = b.user_seq
and a.col_name = b.col_name
and a.value_val = b.value_val
and a.position = b.position
and a.unique_id <> b.unique_id

Answer 3

假设您可以将值连接成字符串，也许最简单的方法是：

select *
from (select unique_id, count(*) over (partition by vals) as cnt
      from (select unique_id,
                   listagg(user_id || ':' || user_seq || ':' || col_name || ':' || value_val || ':' || position, ',') within group (order by user_id, user_seq, col_name, value_val, position) as vals
            from sample_data sd
            group by unique_id
           ) sd
      ) sd
where cnt > 1;

Here是db <>小提琴。

让我强调一下：由于Oracle中内部字符串长度的限制，这不是通用解决方案。但这对您的数据有效，并且可能是解决问题的便捷解决方案。

查找重复的行组合-Oracle SQL

3 个答案: