有没有一种方法可以将在不同字段上匹配的帐户分组在一起,然后将这些帐户分组到任何分组?

时间:2019-10-14 15:58:31

标签: sql ssms

我需要为将与我们建立多个帐户的公司创建一个新的唯一ID。我有公司名称,但我知道有些错字会导致数据丢失。在这种情况下,我还有其他独特的领域想与数据分组在一起。

我有数十万个帐户,这些帐户至少具有给定的信息 公司名称,电话和电子邮件。 如果两个帐户之间的公司名称不同,但是这些帐户在任何其他字段上都匹配,我希望SQL给它赋予相同的唯一ID(可能是拼写错误)。此外,如果另一个帐户与帐户2的公司名称匹配,但在其他任何字段都没有匹配的情况下,我希望它收到与前两个相同的唯一ID(可能是重复输入错误)。

SQL query like GROUP BY with OR condition

除了需要添加第三个字段外,这与我尝试执行的操作类似。但是我注意到,在尝试使用建议的答案(如下)之后,它只会根据ID号将一些公司分组在一起。这似乎是因为它添加了b.linkedId

WITH Nodes AS
(
    SELECT DENSE_RANK() OVER (ORDER BY Part, PartRank) SetId
        , [ID]
    FROM
    (
        SELECT [ID], 1 Part, DENSE_RANK() OVER (ORDER BY [E-mail]) PartRank
        FROM dbo.Customer
        UNION ALL
        SELECT [ID], 2, DENSE_RANK() OVER (ORDER BY Phone) PartRank
        FROM dbo.Customer
    ) A
),
Links AS
(
    SELECT DISTINCT A.Id, B.Id LinkedId
    FROM Nodes A
    JOIN Nodes B ON B.SetId = A.SetId AND B.Id < A.Id
),
Routes AS
(
    SELECT DISTINCT Id, Id LinkedId
    FROM dbo.Customer

    UNION ALL

    SELECT DISTINCT Id, LinkedId
    FROM Links

    UNION ALL

    SELECT A.Id, B.LinkedId
    FROM Links A
    JOIN Routes B ON B.Id = A.LinkedId AND B.LinkedId < A.Id
),
TransitiveClosure AS
(
    SELECT Id, Id LinkedId
    FROM Links

    UNION

    SELECT LinkedId Id, LinkedId
    FROM Links

    UNION

    SELECT Id, LinkedId
    FROM Routes
),
UniqueCustomers AS
(
    SELECT Id, MIN(LinkedId) UniqueCustomerId
    FROM TransitiveClosure
    GROUP BY Id
)
SELECT A.Id, A.[E-mail], A.Phone, B.UniqueCustomerId
FROM dbo.Customer A
JOIN UniqueCustomers B ON B.Id = A.Id

下面的代码是我用来显示为什么上面的代码无法按预期工作的原因。 ID是每个帐户的唯一标识符,Email是要汇总的一个字段,电话是另一个字段。我注释掉的所有内容都是我试图在查询中添加Company_Name的原因,但这就是我意识到存在问题的时候。如果我可以在电话和电子邮件上进行分组,那么解决此问题的下一步就是使其也与公司名称一起使用。

With Customers AS 
(
    Select 1 as [ID]
            ,'John@G.com' as [E-mail]
            ,'111-111-1111' as [Phone]
            --,'Mcdonaldss' as [Company_Name]
    union all 
    Select 0, 'Harry@g.com', '121-212-1212'--,'Mmcdonalds'
    union all
    Select 2, 'Grant@g.com', '111-111-1111'--, 'Mcdonallds'    
    union all
    Select 3, 'John@G.com', '222-222-2222'--, 'Mcdonnalds'    
    union all
    select 4, 'Harry@g.com', '222-222-2222'--, 'Mccdonalds'    
    union all 
    Select 5, 'Jack@g.com', '444-444-4444'--, 'Wendys'     
            --union all
            --Select 10, 'Sarah@g.com', '888-888-8888', 'Mcdonald'    
            --union all 
            --Select 9, 'Sarah@g.com', '999-999-9999', 'Mcdoonalds'     
            --union all
            --Select 8, 'Jessy@g.com', '999-999-9999', 'Mcds'     
            --Union all
            --Select 7, 'Jessy@g.com', '777-777-7777', 'Mcdanalds'     
            --Union all
            --Select 6, 7, '777-777-7777', 'Mcdonolds'    
            --Union all
            --Select 11, 8, '222-222-2222', 'Mcds'
),
 Nodes AS
(
    SELECT DENSE_RANK() OVER (ORDER BY Part, PartRank) SetId
        , [ID]
    FROM
    (
        SELECT c.[ID], '1 email' Part, DENSE_RANK() OVER (ORDER BY [E-mail]) PartRank
        FROM Customers as [c]
        UNION ALL
        SELECT c.[ID], '2 phone', DENSE_RANK() OVER (ORDER BY Phone) PartRank
        FROM Customers as [c]
        --union all
        --SELECT c.[ID], '3 Compnay_Name', DENSE_RANK() OVER (ORDER BY Next_level) PartRank
  --      FROM #Customer as [c]
    ) A
),
Links AS
(
    SELECT DISTINCT A.Id, B.Id LinkedId
    FROM Nodes A
    JOIN Nodes B ON B.SetId = A.SetId AND B.Id < A.Id
)
--Select * from links
,
roads AS
(
    SELECT DISTINCT Id, Id LinkedId
    FROM Customers as [c]

    UNION ALL

    SELECT DISTINCT Id, LinkedId
    FROM links

    UNION ALL

    SELECT A.Id, B.LinkedId
    FROM Links A
    JOIN Roads B ON B.Id = A.LinkedId AND B.LinkedId < A.Id
)
--Select * from Roads
,
TransitiveClosure AS
(
    SELECT Id, Id LinkedId
    FROM Links

    UNION

    SELECT LinkedId Id, LinkedId
    FROM Links

    UNION

    SELECT Id, LinkedId
    FROM roads
)

--Select * from TransitiveClosure
,
UniqueCustomers AS
(
    SELECT Id, MIN(LinkedId) UniqueCustomerId
    FROM TransitiveClosure
    GROUP BY Id
)
SELECT A.Id, A.[E-mail], A.Phone, dense_rank() over (order by B.UniqueCustomerId) as [Company_no]
FROM Customers A
JOIN UniqueCustomers B ON B.Id = A.Id

在此示例中,我希望ID 0、1、2、3和4都具有相同的Company_no。但是,因为ID 0低于ID 1,所以ID 1不会链接到ID0。这意味着链接到ID 1的任何内容也不会链接到ID 0,并且会拆分结果。 ID 5是我的控制权。它不应该与任何东西聚合在一起。 添加公司名称后,所有其他注释掉的帐户也将与Mcdonalds匹配,但是第二部分我应该能够在准确地加入两个字段之后解决。

所需的输出

ID   E-mail              Phone          Unique ID
---- ------------------- -------------- ------------------------------
0    Harry@g.com        121-212-1212    ─┐
1    John@G.com         111-111-1111     | 
2    Grant@g.com        111-111-1111     ├─ 1 (Mcdonalds)
3    John@G.com         222-222-2222     |
4    Harry@g.com        222-222-2222    ─┘
---- ------------------- -------------- ------------------------------
5    Jack@g.com         444-444-4444    ─── 2 (Wendys)

更新 我设法通过删除上面提到的有问题的过滤器,增加了最大递归并添加对路段中循环次数的限制来使代码达到我想要的效果,如下所示。这样可以提供所需的输出,并且适用于所有三个字段,但是,当从12行增加到104,000 ....时,这不是一个实际的解决方案。它把1000条记录变成了100万条记录,甚至没有完成。有什么技巧可以防止无意义循环或以不同方式处理此分组吗?

当前代码

With Customers AS 
(
    Select 1 as [ID]
            ,'John@G.com' as [E-mail]
            ,'111-111-1111' as [Phone]
            ,'Mcdonaldss' as [Company_Name]
    union all 
    Select 0, 'Harry@g.com', '121-212-1212','Mmcdonalds'
    union all
    Select 2, 'Grant@g.com', '111-111-1111', 'Mcdonallds'    
    union all
    Select 3, 'John@G.com', '222-222-2222', 'Mcdonnalds'    
    union all
    select 4, 'Harry@g.com', '222-222-2222', 'Mccdonalds'    
    union all 
    Select 5, 'Jack@g.com', '444-444-4444', 'Wendys'     
    union all
    Select 10, 'Sarah@g.com', '888-888-8888', 'Mcdonald'    
    union all  
    Select 9, 'Sarah@g.com', '999-999-9999', 'Mcdoonalds'     
    union all
    Select 8, 'Jessy@g.com', '999-999-9999', 'Mcds'     
    Union all
    Select 7, 'Jessy@g.com', '777-777-7777', 'Mcdanalds'     
    Union all
    Select 6, 'Mark@g.com', '777-777-7777', 'Mcdonolds'    
    Union all
    Select 11, 'Carol@g.com', '222-222-2222', 'Mcds'
    Union all
    Select 12, 'carol@g.com', '101-010-1010','Mcdooonalds'
),
 Nodes AS
(
    SELECT DENSE_RANK() OVER (ORDER BY Part, PartRank) SetId
        , [ID]
    FROM
    (
        SELECT c.[ID], '1 email' Part, DENSE_RANK() OVER (ORDER BY [E-mail]) PartRank
        FROM Customers as [c]
        UNION ALL
        SELECT c.[ID], '2 phone', DENSE_RANK() OVER (ORDER BY Phone) PartRank
        FROM Customers as [c]
        union all
        SELECT c.[ID], '3 Company_Name', DENSE_RANK() OVER (ORDER BY Company_Name) PartRank
        FROM Customers as [c]
    ) A
),
Links AS
(
    SELECT DISTINCT A.Id, B.Id LinkedId
    FROM Nodes A
    JOIN Nodes B ON B.SetId = A.SetId AND B.Id <> A.Id
)
--Select * from links
,
roads AS
(

    SELECT DISTINCT Id, Id LinkedId, 0 as [count]
    FROM Customers as [c]

    UNION ALL

    SELECT DISTINCT Id, LinkedId, '0'
    FROM links

    UNION ALL

    SELECT
     A.Id, B.LinkedId, B.[count]+1
    FROM Links A
    JOIN Roads B ON B.Id = A.LinkedId AND B.LinkedId <> A.Id
    Where b.[count] <= 10

)
--Select * from Roads
,
TransitiveClosure AS
(
    SELECT Id, Id LinkedId
    FROM Links

    UNION

    SELECT LinkedId Id, LinkedId
    FROM Links

    UNION

    SELECT Id, LinkedId
    FROM roads
)

--Select * from TransitiveClosure
,
UniqueCustomers AS
(
    SELECT Id, MIN(LinkedId) UniqueCustomerId
    FROM TransitiveClosure
    GROUP BY Id
)
SELECT A.Id, A.[E-mail], A.Phone, A.company_Name, dense_rank() over (order by B.UniqueCustomerId) as [Company_no]
FROM Customers A
JOIN UniqueCustomers B ON B.Id = A.Id
option (maxrecursion 32767)

0 个答案:

没有答案