在表格中查找重叠的数据集

时间:2014-08-06 13:55:05

标签: sql sql-server

我需要识别重复的数据集,并将这些数据集与组ID相似。

id     threshold     cost
--     ----------    ----------
1      0             9
1      100           7
1      500           6
2      0             9
2      100           7
2      500           6

我有成千上万的这些套装,大多数都是相同的,不同的id。我需要找到具有相同阈值和成本金额的所有类似集合,并为它们提供组ID。我不知道从哪里开始。是迭代并将每个集合插入表中的最佳方法,然后每个迭代表中的每个集合以查找已存在的内容吗?

2 个答案:

答案 0 :(得分:1)

这是您可以尝试使用关系运算符执行某些操作的情况之一。或者,您可以说:"让我们将所有信息都放在一个字符串中,然后将其用作组ID"。 SQL Server似乎不鼓励这种方法,但它是可能的。所以,让我们使用:

来表征群体
select d.id,
       (select cast(threshold as varchar(8000)) + '-' + cast(cost as varchar(8000)) + ';'
        from data d2
        where d2.id = d.id
        for xml path ('')
        order by threshold
       ) as groupname
from data d
group by d.id;

哦,我认为这可以解决你的问题。 groupname可以作为群组ID。如果你想要一个数字id(这可能是一个好主意,请使用dense_rank()

select d.id, dense_rank() over (order by groupname) as groupid
from (select d.id,
             (select cast(threshold as varchar(8000)) + '-' + cast(cost as varchar(8000)) + ';'
              from data d2
              where d2.id = d.id
              for xml path ('')
              order by threshold
             ) as groupname
      from data d
      group by d.id
     ) d;

答案 1 :(得分:0)

以下是我对这个问题的解释的解决方案:

IF OBJECT_ID('tempdb..#tempGrouping') IS NOT NULL DROP Table #tempGrouping;


;
WITH BaseTable AS 
(
              SELECT 1 id, 0 as threshold, 9 as cost

        UNION SELECT 1, 100, 7

        UNION SELECT 1, 500, 6

        UNION SELECT 2, 0, 9

        UNION SELECT 2, 100, 7

        UNION SELECT 2, 500, 6

        UNION SELECT 3, 1, 9

        UNION SELECT 3, 100, 7

        UNION SELECT 3, 500, 6
)

, BaseCTE AS 
(

    SELECT 
        id
        --,dense_rank() over (order by threshold, cost ) as GroupId  
        ,
        (
            SELECT CAST(TblGrouping.threshold AS varchar(8000)) + '/' + CAST(TblGrouping.cost AS varchar(8000)) + ';'
            FROM BaseTable AS TblGrouping 
            WHERE TblGrouping.id = BaseTable.id
            ORDER BY TblGrouping.threshold, TblGrouping.cost
            FOR XML PATH ('')
       ) AS MultiGroup 

    FROM BaseTable 

    GROUP BY id 
) 
,
CTE AS 
(
    SELECT 
         * 
        ,DENSE_RANK() OVER (ORDER BY MultiGroup) AS GroupId  
    FROM BaseCTE 
)
SELECT * 
INTO #tempGrouping
FROM CTE  



-- SELECT * FROM #tempGrouping; 


UPDATE BaseTable 
    SET BaseTable.GroupId = #tempGrouping.GroupId 
FROM BaseTable 

INNER JOIN #tempGrouping 
    ON BaseTable.Id = #tempGrouping.Id 


IF OBJECT_ID('tempdb..#tempGrouping') IS NOT NULL DROP Table #tempGrouping;

BaseTable是你的桌子,并且你不需要CTE“BaseTable”,因为你有一个数据表。
如果阈值和成本字段可以为NULL,则可能需要采取额外的预防措施。