我需要为将与我们建立多个帐户的公司创建一个新的唯一ID。我有公司名称,但我知道有些错字会导致数据丢失。在这种情况下,我还有其他独特的领域想与数据分组在一起。
我有数十万个帐户,这些帐户至少具有给定的信息 公司名称,电话和电子邮件。 如果两个帐户之间的公司名称不同,但是这些帐户在任何其他字段上都匹配,我希望SQL给它赋予相同的唯一ID(可能是拼写错误)。此外,如果另一个帐户与帐户2的公司名称匹配,但在其他任何字段都没有匹配的情况下,我希望它收到与前两个相同的唯一ID(可能是重复输入错误)。
SQL query like GROUP BY with OR condition
除了需要添加第三个字段外,这与我尝试执行的操作类似。但是我注意到,在尝试使用建议的答案(如下)之后,它只会根据ID号将一些公司分组在一起。这似乎是因为它添加了b.linkedId 下面的代码是我用来显示为什么上面的代码无法按预期工作的原因。 ID是每个帐户的唯一标识符,Email是要汇总的一个字段,电话是另一个字段。我注释掉的所有内容都是我试图在查询中添加Company_Name的原因,但这就是我意识到存在问题的时候。如果我可以在电话和电子邮件上进行分组,那么解决此问题的下一步就是使其也与公司名称一起使用。 在此示例中,我希望ID 0、1、2、3和4都具有相同的Company_no。但是,因为ID 0低于ID 1,所以ID 1不会链接到ID0。这意味着链接到ID 1的任何内容也不会链接到ID 0,并且会拆分结果。 ID 5是我的控制权。它不应该与任何东西聚合在一起。
添加公司名称后,所有其他注释掉的帐户也将与Mcdonalds匹配,但是第二部分我应该能够在准确地加入两个字段之后解决。 所需的输出 更新
我设法通过删除上面提到的有问题的过滤器,增加了最大递归并添加对路段中循环次数的限制来使代码达到我想要的效果,如下所示。这样可以提供所需的输出,并且适用于所有三个字段,但是,当从12行增加到104,000 ....时,这不是一个实际的解决方案。它把1000条记录变成了100万条记录,甚至没有完成。有什么技巧可以防止无意义循环或以不同方式处理此分组吗? 当前代码WITH Nodes AS
(
SELECT DENSE_RANK() OVER (ORDER BY Part, PartRank) SetId
, [ID]
FROM
(
SELECT [ID], 1 Part, DENSE_RANK() OVER (ORDER BY [E-mail]) PartRank
FROM dbo.Customer
UNION ALL
SELECT [ID], 2, DENSE_RANK() OVER (ORDER BY Phone) PartRank
FROM dbo.Customer
) A
),
Links AS
(
SELECT DISTINCT A.Id, B.Id LinkedId
FROM Nodes A
JOIN Nodes B ON B.SetId = A.SetId AND B.Id < A.Id
),
Routes AS
(
SELECT DISTINCT Id, Id LinkedId
FROM dbo.Customer
UNION ALL
SELECT DISTINCT Id, LinkedId
FROM Links
UNION ALL
SELECT A.Id, B.LinkedId
FROM Links A
JOIN Routes B ON B.Id = A.LinkedId AND B.LinkedId < A.Id
),
TransitiveClosure AS
(
SELECT Id, Id LinkedId
FROM Links
UNION
SELECT LinkedId Id, LinkedId
FROM Links
UNION
SELECT Id, LinkedId
FROM Routes
),
UniqueCustomers AS
(
SELECT Id, MIN(LinkedId) UniqueCustomerId
FROM TransitiveClosure
GROUP BY Id
)
SELECT A.Id, A.[E-mail], A.Phone, B.UniqueCustomerId
FROM dbo.Customer A
JOIN UniqueCustomers B ON B.Id = A.Id
With Customers AS
(
Select 1 as [ID]
,'John@G.com' as [E-mail]
,'111-111-1111' as [Phone]
--,'Mcdonaldss' as [Company_Name]
union all
Select 0, 'Harry@g.com', '121-212-1212'--,'Mmcdonalds'
union all
Select 2, 'Grant@g.com', '111-111-1111'--, 'Mcdonallds'
union all
Select 3, 'John@G.com', '222-222-2222'--, 'Mcdonnalds'
union all
select 4, 'Harry@g.com', '222-222-2222'--, 'Mccdonalds'
union all
Select 5, 'Jack@g.com', '444-444-4444'--, 'Wendys'
--union all
--Select 10, 'Sarah@g.com', '888-888-8888', 'Mcdonald'
--union all
--Select 9, 'Sarah@g.com', '999-999-9999', 'Mcdoonalds'
--union all
--Select 8, 'Jessy@g.com', '999-999-9999', 'Mcds'
--Union all
--Select 7, 'Jessy@g.com', '777-777-7777', 'Mcdanalds'
--Union all
--Select 6, 7, '777-777-7777', 'Mcdonolds'
--Union all
--Select 11, 8, '222-222-2222', 'Mcds'
),
Nodes AS
(
SELECT DENSE_RANK() OVER (ORDER BY Part, PartRank) SetId
, [ID]
FROM
(
SELECT c.[ID], '1 email' Part, DENSE_RANK() OVER (ORDER BY [E-mail]) PartRank
FROM Customers as [c]
UNION ALL
SELECT c.[ID], '2 phone', DENSE_RANK() OVER (ORDER BY Phone) PartRank
FROM Customers as [c]
--union all
--SELECT c.[ID], '3 Compnay_Name', DENSE_RANK() OVER (ORDER BY Next_level) PartRank
-- FROM #Customer as [c]
) A
),
Links AS
(
SELECT DISTINCT A.Id, B.Id LinkedId
FROM Nodes A
JOIN Nodes B ON B.SetId = A.SetId AND B.Id < A.Id
)
--Select * from links
,
roads AS
(
SELECT DISTINCT Id, Id LinkedId
FROM Customers as [c]
UNION ALL
SELECT DISTINCT Id, LinkedId
FROM links
UNION ALL
SELECT A.Id, B.LinkedId
FROM Links A
JOIN Roads B ON B.Id = A.LinkedId AND B.LinkedId < A.Id
)
--Select * from Roads
,
TransitiveClosure AS
(
SELECT Id, Id LinkedId
FROM Links
UNION
SELECT LinkedId Id, LinkedId
FROM Links
UNION
SELECT Id, LinkedId
FROM roads
)
--Select * from TransitiveClosure
,
UniqueCustomers AS
(
SELECT Id, MIN(LinkedId) UniqueCustomerId
FROM TransitiveClosure
GROUP BY Id
)
SELECT A.Id, A.[E-mail], A.Phone, dense_rank() over (order by B.UniqueCustomerId) as [Company_no]
FROM Customers A
JOIN UniqueCustomers B ON B.Id = A.Id
ID E-mail Phone Unique ID
---- ------------------- -------------- ------------------------------
0 Harry@g.com 121-212-1212 ─┐
1 John@G.com 111-111-1111 |
2 Grant@g.com 111-111-1111 ├─ 1 (Mcdonalds)
3 John@G.com 222-222-2222 |
4 Harry@g.com 222-222-2222 ─┘
---- ------------------- -------------- ------------------------------
5 Jack@g.com 444-444-4444 ─── 2 (Wendys)
With Customers AS
(
Select 1 as [ID]
,'John@G.com' as [E-mail]
,'111-111-1111' as [Phone]
,'Mcdonaldss' as [Company_Name]
union all
Select 0, 'Harry@g.com', '121-212-1212','Mmcdonalds'
union all
Select 2, 'Grant@g.com', '111-111-1111', 'Mcdonallds'
union all
Select 3, 'John@G.com', '222-222-2222', 'Mcdonnalds'
union all
select 4, 'Harry@g.com', '222-222-2222', 'Mccdonalds'
union all
Select 5, 'Jack@g.com', '444-444-4444', 'Wendys'
union all
Select 10, 'Sarah@g.com', '888-888-8888', 'Mcdonald'
union all
Select 9, 'Sarah@g.com', '999-999-9999', 'Mcdoonalds'
union all
Select 8, 'Jessy@g.com', '999-999-9999', 'Mcds'
Union all
Select 7, 'Jessy@g.com', '777-777-7777', 'Mcdanalds'
Union all
Select 6, 'Mark@g.com', '777-777-7777', 'Mcdonolds'
Union all
Select 11, 'Carol@g.com', '222-222-2222', 'Mcds'
Union all
Select 12, 'carol@g.com', '101-010-1010','Mcdooonalds'
),
Nodes AS
(
SELECT DENSE_RANK() OVER (ORDER BY Part, PartRank) SetId
, [ID]
FROM
(
SELECT c.[ID], '1 email' Part, DENSE_RANK() OVER (ORDER BY [E-mail]) PartRank
FROM Customers as [c]
UNION ALL
SELECT c.[ID], '2 phone', DENSE_RANK() OVER (ORDER BY Phone) PartRank
FROM Customers as [c]
union all
SELECT c.[ID], '3 Company_Name', DENSE_RANK() OVER (ORDER BY Company_Name) PartRank
FROM Customers as [c]
) A
),
Links AS
(
SELECT DISTINCT A.Id, B.Id LinkedId
FROM Nodes A
JOIN Nodes B ON B.SetId = A.SetId AND B.Id <> A.Id
)
--Select * from links
,
roads AS
(
SELECT DISTINCT Id, Id LinkedId, 0 as [count]
FROM Customers as [c]
UNION ALL
SELECT DISTINCT Id, LinkedId, '0'
FROM links
UNION ALL
SELECT
A.Id, B.LinkedId, B.[count]+1
FROM Links A
JOIN Roads B ON B.Id = A.LinkedId AND B.LinkedId <> A.Id
Where b.[count] <= 10
)
--Select * from Roads
,
TransitiveClosure AS
(
SELECT Id, Id LinkedId
FROM Links
UNION
SELECT LinkedId Id, LinkedId
FROM Links
UNION
SELECT Id, LinkedId
FROM roads
)
--Select * from TransitiveClosure
,
UniqueCustomers AS
(
SELECT Id, MIN(LinkedId) UniqueCustomerId
FROM TransitiveClosure
GROUP BY Id
)
SELECT A.Id, A.[E-mail], A.Phone, A.company_Name, dense_rank() over (order by B.UniqueCustomerId) as [Company_no]
FROM Customers A
JOIN UniqueCustomers B ON B.Id = A.Id
option (maxrecursion 32767)