希望我错过了一个简单的解决方案。
我有两张桌子。一个包含公司列表。第二个包含发布者列表。两者之间的映射是多对多的。我想要做的是捆绑或分组表A中与表B中的发布者有任何关系的所有公司,反之亦然。
最终结果看起来像这样(GROUPID是关键字段)。第1行和第2行属于同一组,因为它们共享同一家公司。第3行位于同一组中,因为发布者Y已映射到公司A.第4行位于组中,因为公司B已通过发布者Y映射到第1组。
简单地说,只要公司和发布商之间存在任何形式的共享关系,就应该将该对分配给同一组。
ROW GROUPID Company Publisher
1 1 A Y
2 1 A X
3 1 B Y
4 1 B Z
5 2 C W
6 2 C P
7 2 D W
更新
我的赏金版本:鉴于以上简称Company
和Publisher
对的表格,请填充上方的GROUPID
字段。可以将其视为创建包含所有相关父母/子女的Family
ID。
SQL Server 2012
答案 0 :(得分:10)
我考虑过使用recursive CTE,但据我所知,SQL Server中不可能使用UNION
来连接锚成员和递归CTE的递归成员(我认为这是可能的)在PostgreSQL中做,所以不可能消除重复。
declare @i int
with cte as (
select
GroupID,
row_number() over(order by Company) as rn
from Table1
)
update cte set GroupID = rn
select @i = @@rowcount
-- while some rows updated
while @i > 0
begin
update T1 set
GroupID = T2.GroupID
from Table1 as T1
inner join (
select T2.Company, min(T2.GroupID) as GroupID
from Table1 as T2
group by T2.Company
) as T2 on T2.Company = T1.Company
where T1.GroupID > T2.GroupID
select @i = @@rowcount
update T1 set
GroupID = T2.GroupID
from Table1 as T1
inner join (
select T2.Publisher, min(T2.GroupID) as GroupID
from Table1 as T2
group by T2.Publisher
) as T2 on T2.Publisher = T1.Publisher
where T1.GroupID > T2.GroupID
-- will be > 0 if any rows updated
select @i = @i + @@rowcount
end
;with cte as (
select
GroupID,
dense_rank() over(order by GroupID) as rn
from Table1
)
update cte set GroupID = rn
<强> sql fiddle demo 强>
我也尝试了广度优先搜索算法。我认为它可能更快(在复杂性方面更好),所以我将在这里提供一个解决方案。我发现它并不比SQL方法快:
declare @Company nvarchar(2), @Publisher nvarchar(2), @GroupID int
declare @Queue table (
Company nvarchar(2), Publisher nvarchar(2), ID int identity(1, 1),
primary key(Company, Publisher)
)
select @GroupID = 0
while 1 = 1
begin
select top 1 @Company = Company, @Publisher = Publisher
from Table1
where GroupID is null
if @@rowcount = 0 break
select @GroupID = @GroupID + 1
insert into @Queue(Company, Publisher)
select @Company, @Publisher
while 1 = 1
begin
select top 1 @Company = Company, @Publisher = Publisher
from @Queue
order by ID asc
if @@rowcount = 0 break
update Table1 set
GroupID = @GroupID
where Company = @Company and Publisher = @Publisher
delete from @Queue where Company = @Company and Publisher = @Publisher
;with cte as (
select Company, Publisher from Table1 where Company = @Company and GroupID is null
union all
select Company, Publisher from Table1 where Publisher = @Publisher and GroupID is null
)
insert into @Queue(Company, Publisher)
select distinct c.Company, c.Publisher
from cte as c
where not exists (select * from @Queue as q where q.Company = c.Company and q.Publisher = c.Publisher)
end
end
<强> sql fiddle demo 强>
我已经测试了我的版本和Gordon Linoff,以检查它的表现。看起来CTE要糟糕得多,我已经迫不及待地等待了1000多行。
这是sql fiddle demo随机数据。我的结果是:
128行:
我的RBAR解决方案: 190ms
我的SQL解决方案: 27ms
Gordon Linoff的解决方案: 958ms
256行:
我的RBAR解决方案: 560ms
我的SQL解决方案: 1226ms
Gordon Linoff的解决方案: 45371ms
这是随机数据,因此结果可能不是很一致。我认为时间可以通过索引来改变,但不要认为它可以改变整体情况。
旧版本 - 使用临时表,只需在不触及初始表的情况下计算GroupID:
declare @i int
-- creating table to gather all possible GroupID for each row
create table #Temp
(
Company varchar(1), Publisher varchar(1), GroupID varchar(1),
primary key (Company, Publisher, GroupID)
)
-- initializing it with data
insert into #Temp (Company, Publisher, GroupID)
select Company, Publisher, Company
from Table1
select @i = @@rowcount
-- while some rows inserted into #Temp
while @i > 0
begin
-- expand #Temp in both directions
;with cte as (
select
T2.Company, T1.Publisher,
T1.GroupID as GroupID1, T2.GroupID as GroupID2
from #Temp as T1
inner join #Temp as T2 on T2.Company = T1.Company
union
select
T1.Company, T2.Publisher,
T1.GroupID as GroupID1, T2.GroupID as GroupID2
from #Temp as T1
inner join #Temp as T2 on T2.Publisher = T1.Publisher
), cte2 as (
select
Company, Publisher,
case when GroupID1 < GroupID2 then GroupID1 else GroupID2 end as GroupID
from cte
)
insert into #Temp
select Company, Publisher, GroupID
from cte2
-- don't insert duplicates
except
select Company, Publisher, GroupID
from #Temp
-- will be > 0 if any row inserted
select @i = @@rowcount
end
select
Company, Publisher,
dense_rank() over(order by min(GroupID)) as GroupID
from #Temp
group by Company, Publisher
答案 1 :(得分:6)
你的问题是寻找连通子图的图形行走问题。这更具挑战性,因为您的数据结构有两种类型的节点(“公司”和“发布者”),而不是一种类型。
您可以使用单个递归CTE解决此问题。逻辑如下。
首先,将问题转换为仅包含一种节点类型的图形。我这样做是通过使用发布者信息使节点公司和公司之间的边缘链接。这只是一个加入:
select t1.company as node1, t2.company as node2
from table1 t1 join
table1 t2
on t1.publisher = t2.publisher
)
(为了提高效率,您还可以添加t1.company <> t2.company
,但这并非绝对必要。)
现在,这是一个“简单”的图形行走问题,其中递归CTE用于创建两个节点之间的所有连接。递归CTE使用join
遍历图表。在此过程中,它会保留所有访问过的节点的列表。在SQL Server中,这需要存储在字符串中。
代码需要确保它不会为给定路径访问节点两次,因为这会导致无限递归(和错误)。如果以上称为edges
,则生成所有连接节点对的CTE如下所示:
cte as (
select e.node1, e.node2, cast('|'+e.node1+'|'+e.node2+'|' as varchar(max)) as nodes,
1 as level
from edges e
union all
select c.node1, e.node2, c.nodes+e.node2+'|', 1+c.level
from cte c join
edges e
on c.node2 = e.node1 and
c.nodes not like '|%'+e.node2+'%|'
)
现在,通过此连接节点列表,为每个节点分配其连接的所有节点中的最小节点,包括其自身。这用作连接子图的标识符。也就是说,所有通过发布商相互关联的公司都具有相同的最低限度。
最后两步是枚举此最小值(作为GroupId
)并将GroupId
加回原始数据。
完整(我可能会添加测试)查询看起来像:
with edges as (
select t1.company as node1, t2.company as node2
from table1 t1 join
table1 t2
on t1.publisher = t2.publisher
),
cte as (
select e.node1, e.node2,
cast('|'+e.node1+'|'+e.node2+'|' as varchar(max)) as nodes,
1 as level
from edges e
union all
select c.node1, e.node2,
c.nodes+e.node2+'|',
1+c.level
from cte c join
edges e
on c.node2 = e.node1 and
c.nodes not like '|%'+e.node2+'%|'
),
nodes as (
select node1,
(case when min(node2) < node1 then min(node2) else node1 end
) as grp
from cte
group by node1
)
select t.company, t.publisher, grp.GroupId
from table1 t join
(select n.node1, dense_rank() over (order by grp) as GroupId
from nodes n
) grp
on t.company = grp.node1;
请注意,这适用于查找任何连接的子图。它没有假设任何特定数量的级别。
编辑:
表现问题令人烦恼。至少,使用Publisher
上的索引,上述查询将运行得更好。更好的是采取@ MikaelEriksson的建议,并把边缘放在一个单独的表中。
另一个问题是,您是否在公司或出版商中寻找等效类。我采用了使用公司的方法,因为我认为它具有更好的“可解释性”(我的回应倾向是基于许多评论,这不能用CTE来完成)。
我猜你可以从中获得合理的性能,尽管这需要比OP中提供的更多的数据和系统知识。但是,很有可能最佳性能来自多种查询方法。
答案 2 :(得分:2)
这是我的解决方案SQL Fiddle
关系的本质需要循环,如我所知。
这是SQL:
--drop TABLE Table1
CREATE TABLE Table1
([row] int identity (1,1),GroupID INT NULL,[Company] varchar(2), [Publisher] varchar(2))
;
INSERT INTO Table1
(Company, Publisher)
select
left(newid(), 2), left(newid(), 2)
declare @i int = 1
while @i < 8
begin
;with cte(Company, Publisher) as (
select
left(newid(), 2), left(newid(), 2)
from Table1
)
insert into Table1(Company, Publisher)
select distinct c.Company, c.Publisher
from cte as c
where not exists (select * from Table1 as t where t.Company = c.Company and t.Publisher = c.Publisher)
set @i = @i + 1
end;
CREATE NONCLUSTERED INDEX IX_Temp1 on Table1 (Company)
CREATE NONCLUSTERED INDEX IX_Temp2 on Table1 (Publisher)
declare @counter int=0
declare @row int=0
declare @lastnullcount int=0
declare @currentnullcount int=0
WHILE EXISTS (
SELECT *
FROM Table1
where GroupID is null
)
BEGIN
SET @counter=@counter+1
SET @lastnullcount =0
SELECT TOP 1
@row=[row]
FROM Table1
where GroupID is null
order by [row] asc
SELECT @currentnullcount=count(*) from table1 where groupid is null
WHILE @lastnullcount <> @currentnullcount
BEGIN
SELECT @lastnullcount=count(*)
from table1
where groupid is null
UPDATE Table1
SET GroupID=@counter
WHERE [row]=@row
UPDATE t2
SET t2.GroupID=@counter
FROM Table1 t1
INNER JOIN Table1 t2 on t1.Company=t2.Company
WHERE t1.GroupID=@counter
AND t2.GroupID IS NULL
UPDATE t2
SET t2.GroupID=@counter
FROM Table1 t1
INNER JOIN Table1 t2 on t1.publisher=t2.publisher
WHERE t1.GroupID=@counter
AND t2.GroupID IS NULL
SELECT @currentnullcount=count(*)
from table1
where groupid is null
END
END
SELECT * FROM Table1
编辑: 正如我在实际桌子上所期望的那样添加索引,并且更符合Roman正在使用的其他数据集。
答案 3 :(得分:0)
您正在尝试查找图表中所有已连接的组件,这些组件只能以迭代方式完成。如果您知道任何连接组件的最大宽度(即您必须从一个公司/出版商到另一个公司/出版商的最大链接数),原则上您可以这样做:
SELECT
MIN(x2.groupID) AS groupID,
x1.Company,
x1.Publisher
FROM Table1 AS x1
INNER JOIN (
SELECT
MIN(x2.Company) AS groupID,
x1.Company,
x1.Publisher
FROM Table1 AS x1
INNER JOIN Table1 AS x2
ON x1.Publisher = x2.Publisher
GROUP BY
x1.Publisher,
x1.Company
) AS x2
ON x1.Company = x2.Company
GROUP BY
x1.Publisher,
x1.Company;
您必须将子查询嵌套(公司和发布者上的交替连接,以及最深的子查询,即MIN(公司)而不是MIN(groupID))到最大迭代深度。
但是,我并不是真的推荐这个;在SQL之外做这件事会更干净。免责声明:我对SQL Server 2012(或任何其他版本)一无所知;它可能具有某种额外的脚本功能,可以让你动态地进行迭代。
答案 4 :(得分:0)
这是一个使用XML的递归解决方案:
with a as ( -- recursive result, containing shorter subsets and duplicates
select cast('<c>' + company + '</c>' as xml) as companies
,cast('<p>' + publisher + '</p>' as xml) as publishers
from Table1
union all
select a.companies.query('for $c in distinct-values((for $i in /c return string($i),
sql:column("t.company")))
order by $c
return <c>{$c}</c>')
,a.publishers.query('for $p in distinct-values((for $i in /p return string($i),
sql:column("t.publisher")))
order by $p
return <p>{$p}</p>')
from a join Table1 t
on ( a.companies.exist('/c[text() = sql:column("t.company")]') = 0
or a.publishers.exist('/p[text() = sql:column("t.publisher")]') = 0)
and ( a.companies.exist('/c[text() = sql:column("t.company")]') = 1
or a.publishers.exist('/p[text() = sql:column("t.publisher")]') = 1)
), b as ( -- remove the shorter versions from earlier steps of the recursion and the duplicates
select distinct -- distinct cannot work on xml types, hence cast to nvarchar
cast(companies as nvarchar) as companies
,cast(publishers as nvarchar) as publishers
,DENSE_RANK() over(order by cast(companies as nvarchar), cast(publishers as nvarchar)) as groupid
from a
where not exists (select 1 from a as s -- s is a proper subset of a
where (cast('<s>' + cast(s.companies as varchar)
+ '</s><a>' + cast(a.companies as varchar) + '</a>' as xml)
).value('if((count(/s/c) > count(/a/c))
and (some $s in /s/c/text() satisfies
(some $a in /a/c/text() satisfies $s = $a))
) then 1 else 0', 'int') = 1
)
and not exists (select 1 from a as s -- s is a proper subset of a
where (cast('<s>' + cast(s.publishers as nvarchar)
+ '</s><a>' + cast(a.publishers as nvarchar) + '</a>' as xml)
).value('if((count(/s/p) > count(/a/p))
and (some $s in /s/p/text() satisfies
(some $a in /a/p/text() satisfies $s = $a))
) then 1 else 0', 'int') = 1
)
), c as ( -- cast back to xml
select cast(companies as xml) as companies
,cast(publishers as xml) as publishers
,groupid
from b
)
select Co.company.value('(./text())[1]', 'varchar') as company
,Pu.publisher.value('(./text())[1]', 'varchar') as publisher
,c.groupid
from c
cross apply companies.nodes('/c') as Co(company)
cross apply publishers.nodes('/p') as Pu(publisher)
where exists(select 1 from Table1 t -- restrict to only the combinations that exist in the source
where t.company = Co.company.value('(./text())[1]', 'varchar')
and t.publisher = Pu.publisher.value('(./text())[1]', 'varchar')
)
公司集和发布者集保存在中间步骤的XML字段中,由于SQL Server的某些限制(例如无法分组或使用{{},因此需要在xml和nvarchar之间进行一些转换。 1}}在XML列上。
答案 5 :(得分:0)
挑战迟到了,而且由于SQLFiddle似乎已经停止了ATM,我将不得不猜测你的数据结构。然而,它似乎是一个有趣的挑战(它是=)所以这就是我用它做的:
设定:
IF OBJECT_ID('t_link') IS NOT NULL DROP TABLE t_link
IF OBJECT_ID('t_company') IS NOT NULL DROP TABLE t_company
IF OBJECT_ID('t_publisher') IS NOT NULL DROP TABLE t_publisher
IF OBJECT_ID('tempdb..#link_A') IS NOT NULL DROP TABLE #link_A
IF OBJECT_ID('tempdb..#link_B') IS NOT NULL DROP TABLE #link_B
GO
CREATE TABLE t_company ( company_id int IDENTITY(1, 1) NOT NULL PRIMARY KEY,
company_name varchar(100) NOT NULL)
GO
CREATE TABLE t_publisher (publisher_id int IDENTITY(1, 1) NOT NULL PRIMARY KEY,
publisher_name varchar(100) NOT NULL)
CREATE TABLE t_link (company_id int NOT NULL FOREIGN KEY (company_id) REFERENCES t_company (company_id),
publisher_id int NOT NULL FOREIGN KEY (publisher_id) REFERENCES t_publisher (publisher_id),
PRIMARY KEY (company_id, publisher_id),
group_id int NULL
)
GO
-- example content
-- ROW GROUPID Company Publisher
--1 1 A Y
--2 1 A X
--3 1 B Y
--4 1 B Z
--5 2 C W
--6 2 C P
--7 2 D W
INSERT t_company (company_name) VALUES ('A'), ('B'), ('C'), ('D')
INSERT t_publisher (publisher_name) VALUES ('X'), ('Y'), ('Z'), ('W'), ('P')
INSERT t_link (company_id, publisher_id)
SELECT company_id, publisher_id
FROM t_company, t_publisher
WHERE (company_name = 'A' AND publisher_name = 'Y')
OR (company_name = 'A' AND publisher_name = 'X')
OR (company_name = 'B' AND publisher_name = 'Y')
OR (company_name = 'B' AND publisher_name = 'Z')
OR (company_name = 'C' AND publisher_name = 'W')
OR (company_name = 'C' AND publisher_name = 'P')
OR (company_name = 'D' AND publisher_name = 'W')
GO
/*
-- volume testing
TRUNCATE TABLE t_link
DELETE t_company
DELETE t_publisher
DECLARE @company_count int = 1000,
@publisher_count int = 450,
@links_count int = 800
INSERT t_company (company_name)
SELECT company_name = Convert(varchar(100), NewID())
FROM master.dbo.fn_int_list(1, @company_count)
UPDATE STATISTICS t_company
INSERT t_publisher (publisher_name)
SELECT publisher_name = Convert(varchar(100), NewID())
FROM master.dbo.fn_int_list(1, @publisher_count)
UPDATE STATISTICS t_publisher
-- Random links between the companies & publishers
DECLARE @count int
SELECT @count = 0
WHILE @count < @links_count
BEGIN
SELECT TOP 30 PERCENT row_id = IDENTITY(int, 1, 1), company_id = company_id + 0
INTO #link_A
FROM t_company
ORDER BY NewID()
SELECT TOP 30 PERCENT row_id = IDENTITY(int, 1, 1), publisher_id = publisher_id + 0
INTO #link_B
FROM t_publisher
ORDER BY NewID()
INSERT TOP (@links_count - @count) t_link (company_id, publisher_id)
SELECT A.company_id,
B.publisher_id
FROM #link_A A
JOIN #link_B B
ON A.row_id = B.row_id
WHERE NOT EXISTS ( SELECT *
FROM t_link old
WHERE old.company_id = A.company_id
AND old.publisher_id = B.publisher_id)
SELECT @count = @count + @@ROWCOUNT
DROP TABLE #link_A
DROP TABLE #link_B
END
*/
实际分组:
IF OBJECT_ID('tempdb..#links') IS NOT NULL DROP TABLE #links
GO
-- apply grouping
-- init
SELECT row_id = IDENTITY(int, 1, 1),
company_id,
publisher_id,
group_id = 0
INTO #links
FROM t_link
-- don't see an index that would be actually helpful here right-away, using row_id to avoid HEAP
CREATE CLUSTERED INDEX idx0 ON #links (row_id)
--CREATE INDEX idx1 ON #links (company_id)
--CREATE INDEX idx2 ON #links (publisher_id)
UPDATE #links
SET group_id = row_id
-- start grouping
WHILE @@ROWCOUNT > 0
BEGIN
UPDATE #links
SET group_id = new_group_id
FROM #links upd
CROSS APPLY (SELECT new_group_id = Min(group_id)
FROM #links new
WHERE new.company_id = upd.company_id
OR new.publisher_id = upd.publisher_id
) x
WHERE upd.group_id > new_group_id
-- select * from #links
END
-- remove 'holes'
UPDATE #links
SET group_id = (SELECT COUNT(DISTINCT o.group_id)
FROM #links o
WHERE o.group_id <= upd.group_id)
FROM #links upd
GO
UPDATE t_link
SET group_id = new.group_id
FROM t_link upd
LEFT OUTER JOIN #links new
ON new.company_id = upd.company_id
AND new.publisher_id = upd.publisher_id
GO
SELECT row = ROW_NUMBER() OVER (ORDER BY group_id, company_name, publisher_name),
l.group_id,
c.company_name, -- c.company_id,
p.publisher_name -- , p.publisher_id
from t_link l
JOIN t_company c
ON l.company_id = c.company_id
JOIN t_publisher p
ON p.publisher_id = l.publisher_id
ORDER BY 1
乍一看,这种方法还没有被其他人尝试过,有趣的是看看如何以各种方式完成这项工作......(首选不要提前阅读它们,因为它会破坏这个难题=)< / p>
结果看起来像预期的那样(据我了解的要求和示例而言)并且性能也不算太糟糕,尽管没有关于这应该起作用的记录数量的实际指示;不确定它会如何扩展但不要指望太多问题......