如何在SQL上的列中找到具有唯一数据的重复值?

时间:2016-04-07 16:46:57

标签: sql sql-server

问题: 需要查找用户复制数据的所有实例。每次用户单击按钮时,它都会创建一组唯一的相同数据。我需要向一个组提供一个结果集,该组包括批处理的所有实例"复制"最终用户。

示例数据: 在Microsoft SQL Server上使用Microsoft SQL。

日期类型:批量int ,约会日期 ,引用int ,from_state varchar(2) ,to_state varchar(2) ,item int ,qty int

---------------------------------------------------------------------------------
| batch   | date        | reference | from_state | to_state | item        | qty |
---------------------------------------------------------------------------------
| 1234567 | 2016-03-01  | 8213      |  MT        | CA       | 11122334455 | 2   |
---------------------------------------------------------------------------------
| 1234567 | 2016-03-01  | 8213      |  MT        | CA       | 66622334455 | 1   |
---------------------------------------------------------------------------------
| 1234567 | 2016-03-01  | 8213      |  MT        | CA       | 77722334455 | 5   |
---------------------------------------------------------------------------------
| 1239764 | 2016-03-01  | 8597      |  MT        | CA       | 11122334455 | 2   |
---------------------------------------------------------------------------------
| 1239764 | 2016-03-01  | 8597      |  MT        | CA       | 66622334455 | 1   |
---------------------------------------------------------------------------------
| 1239764 | 2016-03-01  | 8597      |  MT        | CA       | 77722334455 | 5   |
---------------------------------------------------------------------------------
| 3456781 | 2016-03-01  | 6854      |  MT        | CA       | 00112233111 | 2   |
---------------------------------------------------------------------------------
| 3456781 | 2016-03-01  | 6854      |  MT        | CA       | 00112255111 | 1   |
---------------------------------------------------------------------------------
| 3456781 | 2016-03-01  | 6854      |  MT        | CA       | 33322334455 | 5   |
---------------------------------------------------------------------------------

期望的结果: 我需要显示所有信息才能解决问题。我可以通过from,to,item和qty来查找重复记录,但我对如何将其与批处理和参考编号联系起来感到很遗憾。

---------------------------------------------------------------------------------
| batch   | date        | reference | from_state | to_state | item        | qty |
---------------------------------------------------------------------------------
| 1234567 | 2016-03-01  | 8213      |  MT        | CA       | 11122334455 | 2   |
---------------------------------------------------------------------------------
| 1234567 | 2016-03-01  | 8213      |  MT        | CA       | 66622334455 | 1   |
---------------------------------------------------------------------------------
| 1234567 | 2016-03-01  | 8213      |  MT        | CA       | 77722334455 | 5   |
---------------------------------------------------------------------------------
| 1239764 | 2016-03-01  | 8597      |  MT        | CA       | 11122334455 | 2   |
---------------------------------------------------------------------------------
| 1239764 | 2016-03-01  | 8597      |  MT        | CA       | 66622334455 | 1   |
---------------------------------------------------------------------------------
| 1239764 | 2016-03-01  | 8597      |  MT        | CA       | 77722334455 | 5   |

被抄袭的代码:

SELECT from_state
,to_state
,item
,qty
,COUNT(*)
FROM #TEMP_duplicates
HAVING COUNT(*) > 1
GROUP BY from_state
,to_state
,item
,qty

2 个答案:

答案 0 :(得分:0)

您似乎在尝试查询中使用基于临时表引用的SQL Server,因此我将使用它运行。

这将处理单个副本。我必须更多地了解数据,以争论它的可靠程度。它可能足以用于手动验证的东西。我会看看我是否可以为多个副本考虑一些事情。

with T as (
    select
        batch,
        min("date") as dt,
        min(reference) as reference,
        min(from_state) as from_state,
        min(to_state) as to_state,
        min(item) as item_min, max(item) as item_max, sum(item) as item_sum,
        min(qty) as qty_min, max(qty) as qty_max, sum(qty) as qty_sum,
        count(*) as cnt
    from <yourdata>
    group by batch
)
select t1.batch
from T t1 inner join T t2
    on t2.batch > t1.batch and t2.reference <> t1.reference
        and t2.dt = t1.dt
        and t2.from_state = t1.from_state and t2.to_state = t1.to_state
        and t2.item_min = t1.item_min and t2.qty_min = t1.qty_min
        and t2.item_max = t1.item_max and t2.qty_max = t1.qty_max
        and t2.item_sum = t1.item_sum and t2.qty_sum = t1.qty_sum
        and t2.cnt = t1.cnt
group by t1.batch

我确定您使用item的类型。您可能需要使用强制转换才能让sum()生效。

编辑我认为这个在处理多个重复项集时更加强大。但不能说出表现。

with T as (
    select
        batch,
        min("date") as dt,
        min(reference) as reference,
        min(from_state) as from_state,
        min(to_state) as to_state,
        min(item) as item_min, max(item) as item_max, sum(item) as item_sum,
        min(qty) as qty_min, max(qty) as qty_max, sum(qty) as qty_sum,
        count(*) as cnt
    from <yourdata>
    group by batch
),
pairs as (
    select t1.*, t2.batch as batch2
    from T t1 inner join T t2
        on t2.batch > t1.batch and t2.reference <> t1.reference
            and t2.dt = t1.dt
            and t2.from_state = t1.from_state and t2.to_state = t1.to_state
            and t2.item_min = t1.item_min and t2.qty_min = t1.qty_min
            and t2.item_max = t1.item_max and t2.qty_max = t1.qty_max
            and t2.item_sum = t1.item_sum and t2.qty_sum = t1.qty_sum
            and t2.cnt = t1.cnt
)
select distinct
    min(batch) over (
        partition by
            dt, from_state, to_state,
            item_min, item_max, item_sum, qty_min, qty_max, qty_sum, cnt
        ) as orig_batch,
    batch2 as dup_batch
from pairs

“原始”批次是ID最低的批次。

也许您希望通过匹配预先聚合的行来继续这种想法。将其附加到上面的CTE:

...
, matches as (
    select p.batch, p.batch2
    from
        pairs p inner join
        <yourdata> d1 on d1.batch = p.batch full outer join
        <yourdata> d2 on d2.batch = p.batch2
            and d2.dt = d1.dt
            and d2.from_state = d1.from_state and d2.to_state = d1.to_state
            and d2.item = d1.item and d2.qty = d1.qty
    group by p.batch, p.batch2
    having
            count(d1.dt) = count(*) and count(d2.dt) = count(*)
        and count(d1.from_state) = count(*) and count(d2.from_state) = count(*)
        and count(d1.to_state) = count(*) and count(d2.to_state) = count(*)
        and count(d1.item) = count(*) and count(d2.item) = count(*)
        and count(d1.qty) = count(*) and count(d2.item) = count(*)
)
select distinct
    min(batch) over (
        partition by
            dt, from_state, to_state,
            item_min, item_max, item_sum, qty_min, qty_max, qty_sum, cnt
        ) as orig_batch,
    batch2 as dup_batch
from pairs p inner join matches m on m.batch = p.batch and m.batch2 = p.batch2

答案 1 :(得分:0)

我感谢所有帮助我解决这个问题的人。

- 将结果集投入临时表#TEMP_baseresults

SELECT batch
,reference 
,from_state
,to_state
,item
,qty
INTO #TEMP_baseresults
FROM datasource

- 查找SAME from_state,to_state,item和qty的所有重复项

SELECT from_state
,to_state
,item
,qty
,count(*) as 'count'
INTO #TEMP_batchduplicates
FROM #TEMP_baseresults
GROUP BY from_state
,to_state
,item
,qty
HAVING COUNT(*) > 1
ORDER BY from_state
,to_state
,item
,qty

- 在基表上加入重复表

SELECT *
FROM #TEMP_baseresults base
JOIN #TEMP_batchduplicates dup
ON dup.from_state = base.from_state
AND dup.to_state = base.to_state
AND dup.item = base.item
AND dup.qty = base.qty
ORDER BY base.from_state
,base.to_state
,base.item

结果显示:

-----------------------------------------------------------------------------------------
| batch   | date        | reference | from_state | to_state | item        | qty | count |
-----------------------------------------------------------------------------------------
| 1234567 | 2016-03-01  | 8213      |  MT        | CA       | 11122334455 | 2   | 2     |
-----------------------------------------------------------------------------------------
| 1234567 | 2016-03-01  | 8213      |  MT        | CA       | 66622334455 | 1   | 2     |
-----------------------------------------------------------------------------------------
| 1234567 | 2016-03-01  | 8213      |  MT        | CA       | 77722334455 | 5   | 2     |
----------------------------------------------------------------------------------------
| 1239764 | 2016-03-01  | 8597      |  MT        | CA       | 11122334455 | 2   | 2     |
----------------------------------------------------------------------------------------
| 1239764 | 2016-03-01  | 8597      |  MT        | CA       | 66622334455 | 1   | 2     |
-----------------------------------------------------------------------------------------
| 1239764 | 2016-03-01  | 8597      |  MT        | CA       | 77722334455 | 5   | 2     |

这将我的数据集过滤掉,仅显示已识别的重复记录,并另外标记数据可能重复的次数。