SQL中的传递匹配

时间:2019-03-26 11:20:52

标签: sql sql-server

我正在处理一项要求,我需要在某些字段上对一组(G1)中的一组记录进行匹配,然后将匹配的记录重新分组为唯一的新组(NG1,NG2 ...)。要求如下所示:

样本数据

DECLARE @table TABLE ([Group] varchar(3), Member varchar(3), Address varchar(3), Phone varchar(3), Email varchar(3)) 

insert @table values
('G1', 'M1', 'A1', 'P1', 'E1'),
('G1', 'M2', 'A2', 'P2', 'E2'),
('G1', 'M3', 'A1', 'P3', 'E1'),
('G1', 'M4', 'A4', 'P3', 'E4'),
('G1', 'M5', 'A5', 'P5', 'E2'),
('G1', 'M6', 'A6', 'P6', 'E6'),
('G1', 'M7', 'A7', 'P6', 'E7'),
('G1', 'M8', 'A8', 'P8', 'E4'),
('G1', 'M9', 'A9', 'P9', 'E7'),
('G1', 'M10', 'A10', 'P10', 'E10')

在随附的样本数据中,M1,M3,M4和M8应该与M1,M3在“地址”和“电子邮件”中匹配; M3依次与Phone上的M4匹配;进而与电子邮件上的M8匹配。也就是说,它们与一个或多个属性相关。

同样,M6,M7和M9应该在另一个唯一的组中;和M2,M5在同一组中(电子邮件匹配)。

由于没有任何匹配的记录,因此单独的M10将在一个组中。

像G1一样,会有不同的主要人群。

有人可以帮忙吗? 注意:我正在使用MS SQL Server

2 个答案:

答案 0 :(得分:1)

在Microsoft SQL Server中,假设数据在名为“ DataTable”的表中,我将执行以下操作:

WITH
    [Matches] AS
    (
        SELECT
            D1.[Group],
            D1.[Member],
            D2.[Member] AS [PreviousMatchingMember]
        FROM
            [DataTable] AS D1
            OUTER APPLY (SELECT TOP (1) [Member]
                         FROM [DataTable]
                         WHERE
                             [Group] = D1.[Group] AND
                             [Member] < D1.[Member] AND
                             ([Address] = D1.[Address] OR
                              [Phone] = D1.[Phone] OR
                              [Email] = D1.[Email])
                         ORDER BY
                             [Member]) AS D2
    ),
    [Groups] AS
    (
        SELECT
            [Group],
            [Member],
            [PreviousMatchingMember],
            'NG' + LTRIM(ROW_NUMBER() OVER (ORDER BY [Group], [Member])) AS [NewGroup]
        FROM
            [Matches]
        WHERE
            [PreviousMatchingMember] IS NULL
    UNION ALL
        SELECT
            M.[Group],
            M.[Member],
            M.[PreviousMatchingMember],
            G.[NewGroup]
        FROM
            [Groups] AS G
            INNER JOIN [Matches] AS M ON
                M.[Group] = G.[Group] AND
                M.[PreviousMatchingMember] = G.[Member]
    )
SELECT
    G.[NewGroup],
    G.[Member],
    D.[Address],
    D.[Phone],
    D.[Email]
FROM
    [Groups] AS G
    INNER JOIN [DataTable] AS D ON
        D.[Group] = G.[Group] AND
        D.[Member] = G.[Member]
ORDER BY
    G.[NewGroup],
    G.[Member];

编辑:

正如APC在他对您的问题的评论中指出的那样,如果一条记录引用了多个其他记录(使用不同的地址/电话/电子邮件字段),则您将遇到一个(巨大)问题。您可能最终拥有可能属于不同组的记录。您可能决定将这些组视为一个组,但是我在这里的解决方案不适合解决这样一个复杂的问题。

答案 1 :(得分:0)

我花了3个CTE和几杯咖啡,但是在这里... 我最主要的担心是我从评论中阅读了

  

这是可重复的任务。会有几个小组,我们   每个小组都必须这样做。所有记录总数   组可能是数百万。

这不是可重复的任务,因为资源消耗很高,我建议您使用它一次标准化您的组,并在您的应用程序或存储过程中添加逻辑以使用所需组存储新数据

DECLARE @table TABLE (id int not null identity, [Group] varchar(3), Member varchar(3), Address varchar(3), Phone varchar(3), Email varchar(3)) 

insert @table values
('G1', 'M1', 'A1', 'P1', 'E1'),
('G1', 'M2', 'A2', 'P2', 'E2'),
('G1', 'M3', 'A1', 'P3', 'E1'),
('G1', 'M4', 'A4', 'P3', 'E4'),
('G1', 'M5', 'A5', 'P5', 'E2'),
('G1', 'M6', 'A6', 'P6', 'E6'),
('G1', 'M7', 'A7', 'P6', 'E7'),
('G1', 'M8', 'A8', 'P8', 'E4'),
('G1', 'M9', 'A9', 'P9', 'E7'),
('G1', 'M10', 'A10', 'P10', 'E10');

with 
/* Find all matches
id  Member  MatchWith
1   M1  M3
2   M2  M5
3   M3  M1
3   M3  M4 ...
*/
matches as (
    SELECT t.id, t.[Group], t.Member, a.member as MatchWith
    from 
    @table t
    outer apply (
        select distinct member 
        from @table 
        where member <> t.member and [group] = t.[group] and (Address = t.Address OR Phone = t.Phone OR Email = t.Email)
    ) a
)
/* Stuffing the matches per member
id  Member  AllMatches
1   M1  M1,M3
2   M2  M2,M5
3   M3  M1,M3,M4 .....
*/
, matchsummary as (
    SELECT DISTINCT id, [Group], Member, STUFF((
                SELECT ',' + Member FROM (
                SELECT m.Member
                UNION ALL
                SELECT DISTINCT MatchWith
                FROM matches
                WHERE Member = m.Member) U
                ORDER BY Member
                FOR XML PATH('')
                ), 1, 1, '') as AllMatches
    FROM matches m
)
/* Recursive CTE to find "cousins" records (M1, M3 matches on Address and Email; M3 in turn matches with M4 on Phone)
id  Member  AllMatches  gr
1   M1  M1,M3   1
2   M2  M2,M5   2
3   M3  M1,M3,M4    1
4   M4  M3,M4,M8    1
*/
, tree as (
    select *, ROW_NUMBER() over (order by id) as gr
    from matchsummary where AllMatches LIKE member+'%'
    /* The groups are created using the Members who are the first one in their matches 
    id  Member  AllMatches  gr
    1   M1  M1,M3   1
    2   M2  M2,M5   2
    6   M6  M6,M7   3
    10  M10 M10 4
    */
    union all
    select s.*, t.gr 
    from matchsummary s
    join tree t on s.Member <> t.Member and s.[Group] = t.[Group] and s.AllMatches NOT LIKE s.member+'%' and t.AllMatches like '%' + s.Member
)
select * from tree
order by id
option(maxrecursion 0)

输出:

  

ID组成员NewGroup

     

1 G1 M1 1

     

2 G1 M2 2

     

3 G1 M3 1

     

4 G1 M4 1

     

5 G1 M5 2

     

6 G1 M6 3

     

7 G1 M7 3

     

8 G1 M8 1

     

9 G1 M9 3

     

10 G1 M10 4

第二个选项

鉴于表的大小,我建议您使用此表,我不是循环的忠实拥护者,但在这里我认为它们值得,这样一来,您无需立即处理所有数据,

首先,您需要在表上添加一个新列来存储新组,我首先想到的是,更改应用程序的逻辑以在插入新记录时计算该组会更好,但是认为更好,插入会导致多个组成为一组,因此您可能需要在应用程序中快速响应。因此,您可以设置作业以根据需要的频率对数据进行重新分组,如果表中具有UpdatedDate字段,则还可以使用Log表优化此解决方案,并仅重新处理在最后执行后修改的组。 / p>

 IF OBJECT_ID('tempdb..#table') IS NOT NULL
    DROP TABLE #table;
CREATE TABLE #table ([Group] varchar(3), Member varchar(3), Address varchar(3), Phone varchar(3), Email varchar(3)) 

INSERT #table ([Group], Member, Address, Phone, Email)
VALUES
('G1', 'M1', 'A1', 'P1', 'E1'),
('G1', 'M2', 'A2', 'P2', 'E2'),
('G1', 'M3', 'A1', 'P3', 'E1'),
('G1', 'M4', 'A4', 'P3', 'E4'),
('G1', 'M5', 'A5', 'P5', 'E2'),
('G1', 'M6', 'A6', 'P6', 'E6'),
('G1', 'M7', 'A7', 'P6', 'E7'),
('G1', 'M8', 'A8', 'P8', 'E4'),
('G1', 'M9', 'A9', 'P9', 'E7'),
('G1', 'M10', 'A10', 'P10', 'E10');

ALTER TABLE #table ADD newGroup INT

/******************************************************************
START HERE
******************************************************************/

IF OBJECT_ID('tempdb..#Groups') IS NOT NULL
    DROP TABLE #Groups;

SELECT DISTINCT [Group] INTO #Groups FROM #table

DECLARE @Group VARCHAR(3)

WHILE EXISTS (SELECT 1 FROM #Groups)
BEGIN

    SELECT TOP 1 @Group = [Group] FROM #Groups

    UPDATE #table SET newGroup = NULL 
    WHERE [Group] = @Group

    DECLARE @newGroup INT = 1
    DECLARE @member varchar(3)

    WHILE EXISTS (SELECT 1 FROM #table WHERE [Group] = @Group AND newGroup IS NULL)
    BEGIN

        SELECT TOP 1 @member = member FROM #table WHERE [group] = @group AND newGroup IS NULL

        UPDATE #table SET newGroup = @newGroup
        WHERE Member = @member

        WHILE @@ROWCOUNT > 0
        BEGIN
            UPDATE T
            SET newGroup = @newGroup
            FROM #table T
            WHERE [Group] = @group AND newGroup IS NULL
            AND EXISTS (
                SELECT 1 FROM #table 
                WHERE newGroup = @newGroup
                AND (Address = t.Address OR Phone = t.Phone OR Email = t.Email)
            )
        END
        SET @newGroup += 1
    END
    DELETE #Groups WHERE [Group] = @Group
END

SELECT * FROM #table