我需要找到一个或多个字段匹配的行集。
E.g:
供应商大师
VendorId | VendorName | Phone | Address | Fax ------------------------------------------------------------------------ 1 AAAA 10101 Street1 111 2 BBBB 20202 Street2 222 3 CCCC 30303 Street3 333 4 DDDD 40404 Street2 444 5 FFFF 50505 Street5 555 6 GGGG 60606 Street6 444 7 HHHH 10101 Street6 777
SELECT VendorId FROM VendorMaster vm
WHERE EXISTS
( Select 1 FROM VendorMaster vm1
WHERE vm1.VendorId <> vm2.VendorId
AND (vm1.Phone = vm2.Phone OR vm1.Address=vm2.Address OR vm1.Fax = vm2.Fax)
通过上述查询,我得到了记录,但我的要求是为每组匹配记录分配一个set-id。
如下所示:
SetId | VendorId --------------------- 1000 1 1000 7 //1 and 7- Phone numbers are matching 1001 2 1001 4 //2 and 4 - Address matching 1001 6 // 4 and 6 - Fax matching
请告诉我如何编写查询为匹配集分配集合ID 。查询的性能也很关键,因为记录数量大约为100,000。
由于
答案 0 :(得分:1)
我相信这会给你想要的结果。评论中有一点解释,如果需要更多,请告诉我。
with relations
--Get all single relationships between vendors.
as (
select t1.vendorId firstId,
t2.vendorId secondId
from VendorMaster t1
inner join VendorMaster t2 on t1.vendorId < t2.vendorId and(
t1.Phone = t2.Phone
or t1.address = t2.address
or t1.Fax = t2.Fax
)
),
recurseLinks
--Recurse the relationships
as (
select r.*, CAST(',' + CAST(r.firstId AS VARCHAR) + ',' AS VARCHAR) tree
from relations r
union all
select r.firstId,
l.secondId,
cast(r.Tree + CAST(l.secondId AS varchar) + ',' as varchar)
from relations l
inner join recurseLinks r on r.secondId = l.firstId and r.tree not like '%' + cast(l.secondId as varchar) + ',%'
union all
select r.firstId,
l.firstId,
cast(r.Tree + CAST(l.firstId AS varchar) + ',' as varchar)
from relations l
inner join recurseLinks r on r.secondId = l.secondId and r.tree not like '%' + cast(l.firstId as varchar) + ',%'
),
removeInvalid
--Removed invalid relationships.
as (
select l1.firstId, l1.secondId
from recurseLinks l1
where l1.firstId < l1.secondId
),
removeIntermediate
--Removed intermediate relationships.
as (
select distinct l1.*
from removeInvalid l1
left join removeInvalid l2 on l2.secondId = l1.firstId
where l2.firstId is null
)
select result.secondId,
dense_rank() over(order by result.firstId) SetId
from (
select firstId,
secondId
from removeIntermediate
union all
select distinct firstId,
firstId
from removeIntermediate
) result;
'relations'命名结果集返回所有VendorMaster关系,它们共享公共电话,地址或传真。它也只返回[A,B]它不会返回反向关系[B,A]。
'recurseLinks'命名结果集稍微复杂一点。它递归地连接彼此相关的所有行。路径列跟踪谱系,因此它不会卡在无限循环中。此联合的第一个查询选择“关系”命名结果集中的所有关系。此联合的第二个查询选择所有前向递归关系,因此给定[A,B],[B,C]和[C,D],然后[A,C],[A,D]和[B,D]被添加到结果集中。联合的第三个查询选择所有非正向递归关系,因此给定[A,D],[C,D],[B,C],然后[A,C],[A,B]和[B,D] ]被添加到结果集中。
'removeInvalid'命名结果集删除递归查询添加的任何无效中间关系。例如,[B,A]因为我们已经有[A,B]。请注意,可以通过一些努力在“recurseLinks”结果集中防止这种情况。
'removeIntermediate'命名结果集删除任何中间关系。因此,给定[A,B],[B,C],[C,D],[A,C],[A,D],它将删除[B,C]和[C,D]。
最终结果集选择当前结果并添加自我关系。因此,给[A,B],[A,C],[A,D]加入[A,A]。产生的是最终结果集。
答案 1 :(得分:0)
您可以使用内置的Ranking functions来完成此任务。例如,对于唯一的地址值:
DECLARE @VendorMaster TABLE ( VendorID INT, Vendorname VARCHAR(20), Phone VARCHAR(20), Address VARCHAR(20), Fax VARCHAR(20) )
INSERT INTO @VendorMaster
(VendorID, Vendorname, Phone, Address, Fax )
VALUES
(1, 'AAAA', '10101', 'Street1', '111'),
(2, 'BBBB', '20202', 'Street2', '222'),
(3, 'CCCC', '30303', 'Street3', '333'),
(4, 'DDDD', '40404', 'Street2', '444'),
(5, 'FFFF', '50505', 'Street5', '555'),
(6, 'GGGG', '60606', 'Street6', '444'),
(7, 'HHHH', '10101', 'Street6', '777')
SELECT
DenseRank = DENSE_RANK() OVER ( ORDER BY Address )
,* FROM @VendorMaster
结果
DenseRank RowNumber VendorID Vendorname Phone Address Fax
1 1 1 AAAA 10101 Street1 111
2 2 2 BBBB 20202 Street2 222
3 3 3 CCCC 30303 Street3 333
2 4 4 DDDD 40404 Street2 444
4 5 5 FFFF 50505 Street5 555
5 6 6 GGGG 60606 Street6 444
5 7 7 HHHH 10101 Street6 777
如果需要保留这些SetId值,则可以创建一个带有标识列的单独表,以跟踪与每个设置的每个SetID关联的值。听起来您可能只是想要规范化您的数据库并将数据元素分解为通过标识列关系链接的自己的表。
答案 2 :(得分:0)
尽管Wills的答案非常巧妙,但我从来都不喜欢递归CTE,因为它们总是在小型设备上运行良好,但在较大的设备上变得非常慢,有时会达到MAXRECURSION限制。
我个人尝试通过先将每个VendorID放在自己的SetID中,然后将上面的SetID合并到具有匹配供应商的较低SetID中来解决这个问题。
它看起来像这样:
-- create test-code
IF OBJECT_ID('VendorMaster') IS NOT NULL DROP TABLE VendorMaster
GO
CREATE TABLE VendorMaster
([VendorID] int IDENTITY(1,1) PRIMARY KEY, [Vendorname] nvarchar(100), [Phone] nvarchar(100) , [Address] nvarchar(100), [Fax] nvarchar(100))
;
INSERT INTO VendorMaster
([Vendorname], [Phone], [Address], [Fax])
VALUES
('AAAA', '10101', 'Street1', '111'),
('BBBB', '20202', 'Street20', '222'),
('CCCC', '30303', 'Street3', '333'),
('DDDD', '40404', 'Street2', '444'),
('FFFF', '50505', 'Street5', '555'),
('GGGG', '60606', 'Street6', '444'),
('HHHH', '10101', 'Street6', '777'),
('IIII', '80808', 'Street20', '888'),
('JJJJ', '90909', 'Street9', '888');
GO
-- create sets and start shifting & merging
DECLARE @rowcount int
SELECT SetID = 1000 + ROW_NUMBER() OVER (ORDER BY VendorID),
VendorID
INTO #result
FROM VendorMaster
SELECT @rowcount = @@ROWCOUNT
CREATE UNIQUE CLUSTERED INDEX uq0 ON #result (VendorID)
WHILE @rowcount > 0
BEGIN
-- find lowest SetID that has a match with current record
;WITH shifting
AS (SELECT newSetID = Min(n.SetID),
oldSetID = o.SetID
FROM #result o
JOIN #result n
ON n.SetID < o.SetID
JOIN VendorMaster vo
ON vo.VendorID = o.VendorID
JOIN VendorMaster vn
ON vn.VendorID = n.VendorID
WHERE vn.Vendorname = vo.Vendorname
OR vn.Phone = vo.Phone
OR vn.Address = vo.Address
OR vn.Fax = vo.Fax
GROUP BY o.SetID)
UPDATE #result
SET SetID = s.newSetID
FROM #result upd
JOIN shifting s
ON s.oldSetID = upd.SetID
AND s.newSetID < upd.SetID
SELECT @rowcount = @@ROWCOUNT
END
-- delete 'single-member-sets' for consistency in compare with CTE of Will
DELETE #result
FROM #result del
WHERE NOT EXISTS ( SELECT *
FROM #result xx
WHERE xx.SetID = del.SetID
AND xx.VendorID <> del.VendorID)
-- fix 'holes'
UPDATE #result
SET SetID = 1 + (SELECT COUNT(DISTINCT SetID)
FROM #result xx
WHERE xx.SetID < upd.SetID)
FROM #result upd
-- show result
SELECT * FROM #result ORDER BY SetID, VendorID
在提供的测试用例上运行时,我得到的结果与CTE相同,但需要更长的时间。
当我添加一些额外的测试数据时,事情会变得很有趣。
DECLARE @counter int = 7
WHILE @counter > 0
BEGIN
INSERT VendorMaster ([Vendorname], [Phone], [Address], [Fax])
SELECT [Vendorname] = NewID(),
[Phone] = ABS(BINARY_CHECKSUM(NewID())) % 1500,
[Address] = NewID(),
[Fax] = NewID()
FROM VendorMaster
SELECT @counter = @counter - 1
END
SELECT COUNT(*) FROM VendorMaster
这给了我1152个测试记录,其中包含我们之前已经拥有的匹配项,但现在还有一些匹配的电话(NewID()不会匹配),以便更容易验证。
当我在上面运行上面的查询时,我只需要在2秒内就可以得到604套。但是,当我在上面运行CTE时,