如何找到一个或多个匹配字段的行集,并为每个匹配集分配一个set id?

时间:2014-10-23 19:16:54

标签: sql sql-server tsql query-performance

我需要找到一个或多个字段匹配的行集。

E.g:

供应商大师

VendorId    |    VendorName   |   Phone     |   Address   |    Fax    
------------------------------------------------------------------------
1                AAAA              10101           Street1         111
2                BBBB              20202           Street2         222
3                CCCC              30303           Street3         333
4                DDDD              40404           Street2         444
5                FFFF              50505           Street5         555
6                GGGG              60606           Street6         444
7                HHHH              10101           Street6         777

SELECT VendorId FROM VendorMaster vm
WHERE EXISTS
   ( Select 1 FROM VendorMaster vm1
     WHERE vm1.VendorId <> vm2.VendorId
     AND (vm1.Phone = vm2.Phone OR vm1.Address=vm2.Address OR vm1.Fax = vm2.Fax)

通过上述查询,我​​得到了记录,但我的要求是为每组匹配记录分配一个set-id。

如下所示:

SetId     |  VendorId
---------------------
 1000             1
 1000             7        //1 and 7- Phone numbers are matching
 1001             2
 1001             4        //2 and 4 - Address matching
 1001             6        // 4 and 6 - Fax matching

请告诉我如何编写查询为匹配集分配集合ID 。查询的性能也很关键,因为记录数量大约为100,000。

由于

3 个答案:

答案 0 :(得分:1)

我相信这会给你想要的结果。评论中有一点解释,如果需要更多,请告诉我。

with relations 
--Get all single relationships between vendors.
as (
    select t1.vendorId firstId,
        t2.vendorId secondId
    from VendorMaster  t1
    inner join VendorMaster  t2 on t1.vendorId < t2.vendorId and(
            t1.Phone = t2.Phone
            or t1.address = t2.address
            or t1.Fax = t2.Fax
            )
    ),
recurseLinks
--Recurse the relationships
as (
    select r.*, CAST(',' + CAST(r.firstId AS VARCHAR) + ',' AS VARCHAR) tree
    from relations r

    union all

    select r.firstId,
        l.secondId,
        cast(r.Tree + CAST(l.secondId AS varchar) + ',' as varchar)
    from relations l
    inner join recurseLinks r on r.secondId = l.firstId and r.tree not like '%' + cast(l.secondId as varchar) + ',%'

    union all

    select r.firstId,
        l.firstId,
        cast(r.Tree + CAST(l.firstId AS varchar) + ',' as varchar)
    from relations l
    inner join recurseLinks r on r.secondId = l.secondId and r.tree not like '%' + cast(l.firstId as varchar) + ',%'
    ),
removeInvalid
--Removed invalid relationships.
as (
    select l1.firstId, l1.secondId
    from recurseLinks l1
    where l1.firstId < l1.secondId
    ),
removeIntermediate
--Removed intermediate relationships.
as (
    select distinct l1.*
    from removeInvalid l1
    left join removeInvalid l2 on l2.secondId = l1.firstId
    where l2.firstId is null
    )
select result.secondId,
    dense_rank() over(order by result.firstId) SetId
from (
    select firstId,
        secondId
    from removeIntermediate 

    union all

    select distinct firstId,
        firstId
    from removeIntermediate 
    ) result;

'relations'命名结果集返回所有VendorMaster关系,它们共享公共电话,地址或传真。它也只返回[A,B]它不会返回反向关系[B,A]。

'recurseLinks'命名结果集稍微复杂一点。它递归地连接彼此相关的所有行。路径列跟踪谱系,因此它不会卡在无限循环中。此联合的第一个查询选择“关系”命名结果集中的所有关系。此联合的第二个查询选择所有前向递归关系,因此给定[A,B],[B,C]和[C,D],然后[A,C],[A,D]和[B,D]被添加到结果集中。联合的第三个查询选择所有非正向递归关系,因此给定[A,D],[C,D],[B,C],然后[A,C],[A,B]和[B,D] ]被添加到结果集中。

'removeInvalid'命名结果集删除递归查询添加的任何无效中间关系。例如,[B,A]因为我们已经有[A,B]。请注意,可以通过一些努力在“recurseLinks”结果集中防止这种情况。

'removeIntermediate'命名结果集删除任何中间关系。因此,给定[A,B],[B,C],[C,D],[A,C],[A,D],它将删除[B,C]和[C,D]。

最终结果集选择当前结果并添加自我关系。因此,给[A,B],[A,C],[A,D]加入[A,A]。产生的是最终结果集。

答案 1 :(得分:0)

您可以使用内置的Ranking functions来完成此任务。例如,对于唯一的地址值:

DECLARE @VendorMaster TABLE ( VendorID INT, Vendorname VARCHAR(20), Phone VARCHAR(20), Address VARCHAR(20), Fax VARCHAR(20) )
INSERT INTO @VendorMaster
  (VendorID, Vendorname, Phone,   Address,   Fax )
VALUES
  (1,        'AAAA',     '10101', 'Street1', '111'),
  (2,        'BBBB',     '20202', 'Street2', '222'),
  (3,        'CCCC',     '30303', 'Street3', '333'),
  (4,        'DDDD',     '40404', 'Street2', '444'),
  (5,        'FFFF',     '50505', 'Street5', '555'),
  (6,        'GGGG',     '60606', 'Street6', '444'),
  (7,        'HHHH',     '10101', 'Street6', '777')

SELECT 
  DenseRank = DENSE_RANK() OVER ( ORDER BY Address )
 ,* FROM @VendorMaster

结果

DenseRank   RowNumber   VendorID    Vendorname  Phone   Address Fax
1   1   1   AAAA    10101   Street1 111
2   2   2   BBBB    20202   Street2 222
3   3   3   CCCC    30303   Street3 333
2   4   4   DDDD    40404   Street2 444
4   5   5   FFFF    50505   Street5 555
5   6   6   GGGG    60606   Street6 444
5   7   7   HHHH    10101   Street6 777

如果需要保留这些SetId值,则可以创建一个带有标识列的单独表,以跟踪与每个设置的每个SetID关联的值。听起来您可能只是想要规范化您的数据库并将数据元素分解为通过标识列关系链接的自己的表。

答案 2 :(得分:0)

尽管Wills的答案非常巧妙,但我从来都不喜欢递归CTE,因为它们总是在小型设备上运行良好,但在较大的设备上变得非常慢,有时会达到MAXRECURSION限制。

我个人尝试通过先将每个VendorID放在自己的SetID中,然后将上面的SetID合并到具有匹配供应商的较低SetID中来解决这个问题。

它看起来像这样:

-- create test-code
IF OBJECT_ID('VendorMaster') IS NOT NULL DROP TABLE VendorMaster
GO

CREATE TABLE VendorMaster
    ([VendorID] int IDENTITY(1,1) PRIMARY KEY, [Vendorname] nvarchar(100), [Phone] nvarchar(100) , [Address] nvarchar(100), [Fax] nvarchar(100))
;

INSERT INTO VendorMaster
    ([Vendorname], [Phone], [Address], [Fax])
VALUES
    ('AAAA',     '10101', 'Street1', '111'),
    ('BBBB',     '20202', 'Street20', '222'),
    ('CCCC',     '30303', 'Street3', '333'),
    ('DDDD',     '40404', 'Street2', '444'),
    ('FFFF',     '50505', 'Street5', '555'),
    ('GGGG',     '60606', 'Street6', '444'),
    ('HHHH',     '10101', 'Street6', '777'),
    ('IIII',     '80808', 'Street20', '888'),
    ('JJJJ',     '90909', 'Street9', '888');

GO
-- create sets and start shifting & merging
DECLARE @rowcount int

SELECT SetID = 1000 + ROW_NUMBER() OVER (ORDER BY VendorID),
       VendorID
  INTO #result
  FROM VendorMaster

SELECT @rowcount = @@ROWCOUNT

CREATE UNIQUE CLUSTERED INDEX uq0 ON #result (VendorID)

WHILE @rowcount > 0
    BEGIN
        -- find lowest SetID that has a match with current record
        ;WITH shifting
           AS (SELECT newSetID = Min(n.SetID), 
                      oldSetID = o.SetID
                 FROM #result o
                 JOIN #result n
                   ON n.SetID < o.SetID
                 JOIN VendorMaster vo
                   ON vo.VendorID = o.VendorID
                 JOIN VendorMaster vn
                   ON vn.VendorID = n.VendorID
                WHERE vn.Vendorname = vo.Vendorname
                   OR vn.Phone = vo.Phone
                   OR vn.Address = vo.Address
                   OR vn.Fax = vo.Fax
                GROUP BY o.SetID)
        UPDATE #result
           SET SetID = s.newSetID
          FROM #result upd
          JOIN shifting s
            ON s.oldSetID = upd.SetID
           AND s.newSetID < upd.SetID

        SELECT @rowcount = @@ROWCOUNT

    END

-- delete 'single-member-sets' for consistency in compare with CTE of Will
DELETE #result 
  FROM #result del
 WHERE NOT EXISTS ( SELECT *
                      FROM #result xx
                     WHERE xx.SetID = del.SetID
                       AND xx.VendorID <> del.VendorID)

-- fix 'holes'
UPDATE #result 
   SET SetID = 1 + (SELECT COUNT(DISTINCT SetID)
                      FROM #result xx
                     WHERE xx.SetID < upd.SetID)
  FROM #result upd

-- show result
SELECT * FROM #result ORDER BY SetID, VendorID

在提供的测试用例上运行时,我得到的结果与CTE相同,但需要更长的时间。

当我添加一些额外的测试数据时,事情会变得很有趣。

DECLARE @counter int = 7

WHILE @counter > 0
    BEGIN

        INSERT VendorMaster ([Vendorname], [Phone], [Address], [Fax])
        SELECT [Vendorname] = NewID(), 
               [Phone]      = ABS(BINARY_CHECKSUM(NewID())) % 1500, 
               [Address]    = NewID(), 
               [Fax]        = NewID()
          FROM VendorMaster 

        SELECT @counter = @counter - 1
    END

SELECT COUNT(*) FROM VendorMaster

这给了我1152个测试记录,其中包含我们之前已经拥有的匹配项,但现在还有一些匹配的电话(NewID()不会匹配),以便更容易验证。

当我在上面运行上面的查询时,我只需要在2秒内就可以得到604套。但是,当我在上面运行CTE时,