将多个到多个关系中的所有相关记录分组,SQL图连接组件

时间:2013-09-04 16:17:38

标签: sql sql-server sql-server-2012

希望我错过了一个简单的解决方案。

我有两张桌子。一个包含公司列表。第二个包含发布者列表。两者之间的映射是多对多的。我想要做的是捆绑或分组表A中与表B中的发布者有任何关系的所有公司,反之亦然。

最终结果看起来像这样(GROUPID是关键字段)。第1行和第2行属于同一组,因为它们共享同一家公司。第3行位于同一组中,因为发布者Y已映射到公司A.第4行位于组中,因为公司B已通过发布者Y映射到第1组。

简单地说,只要公司和发布商之间存在任何形式的共享关系,就应该将该对分配给同一组。

ROW   GROUPID     Company     Publisher
1     1           A           Y
2     1           A           X
3     1           B           Y
4     1           B           Z
5     2           C           W
6     2           C           P
7     2           D           W

Fiddle

更新
我的赏金版本:鉴于以上简称CompanyPublisher对的表格,请填充上方的GROUPID字段。可以将其视为创建包含所有相关父母/子女的Family ID。

SQL Server 2012

6 个答案:

答案 0 :(得分:10)

我考虑过使用recursive CTE,但据我所知,SQL Server中不可能使用UNION来连接锚成员和递归CTE的递归成员(我认为这是可能的)在PostgreSQL中做,所以不可能消除重复。

declare @i int

with cte as (
     select
         GroupID,
         row_number() over(order by Company) as rn
     from Table1
)
update cte set GroupID = rn

select @i = @@rowcount

-- while some rows updated
while @i > 0
begin
    update T1 set
        GroupID = T2.GroupID
    from Table1 as T1
        inner join (
            select T2.Company, min(T2.GroupID) as GroupID
            from Table1 as T2
            group by T2.Company
        ) as T2 on T2.Company = T1.Company
    where T1.GroupID > T2.GroupID

    select @i = @@rowcount

    update T1 set
        GroupID = T2.GroupID
    from Table1 as T1
        inner join (
            select T2.Publisher, min(T2.GroupID) as GroupID
            from Table1 as T2
            group by T2.Publisher
        ) as T2 on T2.Publisher = T1.Publisher
    where T1.GroupID > T2.GroupID

    -- will be > 0 if any rows updated
    select @i = @i + @@rowcount
end

;with cte as (
     select
         GroupID,
         dense_rank() over(order by GroupID) as rn
     from Table1
)
update cte set GroupID = rn

<强> sql fiddle demo

我也尝试了广度优先搜索算法。我认为它可能更快(在复杂性方面更好),所以我将在这里提供一个解决方案。我发现它并不比SQL方法快:

declare @Company nvarchar(2), @Publisher nvarchar(2), @GroupID int

declare @Queue table (
    Company nvarchar(2), Publisher nvarchar(2), ID int identity(1, 1),
    primary key(Company, Publisher)
)

select @GroupID = 0

while 1 = 1
begin
    select top 1 @Company = Company, @Publisher = Publisher
    from Table1
    where GroupID is null

    if @@rowcount = 0 break

    select @GroupID = @GroupID + 1

    insert into @Queue(Company, Publisher)
    select @Company, @Publisher

    while 1 = 1
    begin
        select top 1 @Company = Company, @Publisher = Publisher
        from @Queue
        order by ID asc

        if @@rowcount = 0 break

        update Table1 set
            GroupID = @GroupID
        where Company = @Company and Publisher = @Publisher

        delete from @Queue where Company = @Company and Publisher = @Publisher

        ;with cte as (
            select Company, Publisher from Table1 where Company = @Company and GroupID is null
            union all
            select Company, Publisher from Table1 where Publisher = @Publisher and GroupID is null
        )
        insert into @Queue(Company, Publisher)
        select distinct c.Company, c.Publisher
        from cte as c
        where not exists (select * from @Queue as q where q.Company = c.Company and q.Publisher = c.Publisher)
   end
end

<强> sql fiddle demo

我已经测试了我的版本和Gordon Linoff,以检查它的表现。看起来CTE要糟糕得多,我已经迫不及待地等待了1000多行。

这是sql fiddle demo随机数据。我的结果是:
128行: 我的RBAR解决方案: 190ms 我的SQL解决方案: 27ms
Gordon Linoff的解决方案: 958ms
256行: 我的RBAR解决方案: 560ms 我的SQL解决方案: 1226ms
Gordon Linoff的解决方案: 45371ms

这是随机数据,因此结果可能不是很一致。我认为时间可以通过索引来改变,但不要认为它可以改变整体情况。

版本 - 使用临时表,只需在不触及初始表的情况下计算GroupID:

declare @i int

-- creating table to gather all possible GroupID for each row
create table #Temp
(
    Company varchar(1), Publisher varchar(1), GroupID varchar(1),
    primary key (Company, Publisher, GroupID)
)

-- initializing it with data
insert into #Temp (Company, Publisher, GroupID)
select Company, Publisher, Company
from Table1

select @i = @@rowcount

-- while some rows inserted into #Temp
while @i > 0
begin
    -- expand #Temp in both directions
    ;with cte as (
        select
            T2.Company, T1.Publisher,
            T1.GroupID as GroupID1, T2.GroupID as GroupID2
        from #Temp as T1
            inner join #Temp as T2 on T2.Company = T1.Company
        union
        select
            T1.Company, T2.Publisher,
            T1.GroupID as GroupID1, T2.GroupID as GroupID2
        from #Temp as T1
            inner join #Temp as T2 on T2.Publisher = T1.Publisher        
    ), cte2 as (
        select
            Company, Publisher,
            case when GroupID1 < GroupID2 then GroupID1 else GroupID2 end as GroupID
        from cte
    )
    insert into #Temp
    select Company, Publisher, GroupID
    from cte2
    -- don't insert duplicates
    except
    select Company, Publisher, GroupID
    from #Temp

    -- will be > 0 if any row inserted
    select @i = @@rowcount
end

select
    Company, Publisher,
    dense_rank() over(order by min(GroupID)) as GroupID
from #Temp
group by Company, Publisher

=> sql fiddle example

答案 1 :(得分:6)

你的问题是寻找连通子图的图形行走问题。这更具挑战性,因为您的数据结构有两种类型的节点(“公司”和“发布者”),而不是一种类型。

您可以使用单个递归CTE解决此问题。逻辑如下。

首先,将问题转换为仅包含一种节点类型的图形。我这样做是通过使用发布者信息使节点公司和公司之间的边缘链接。这只是一个加入:

      select t1.company as node1, t2.company as node2
      from table1 t1 join
           table1 t2
           on t1.publisher = t2.publisher
     )

(为了提高效率,您还可以添加t1.company <> t2.company,但这并非绝对必要。)

现在,这是一个“简单”的图形行走问题,其中递归CTE用于创建两个节点之间的所有连接。递归CTE使用join遍历图表。在此过程中,它会保留所有访问过的节点的列表。在SQL Server中,这需要存储在字符串中。

代码需要确保它不会为给定路径访问节点两次,因为这会导致无限递归(和错误)。如果以上称为edges,则生成所有连接节点对的CTE如下所示:

     cte as (
      select e.node1, e.node2, cast('|'+e.node1+'|'+e.node2+'|' as varchar(max)) as nodes,
             1 as level
      from edges e
      union all
      select c.node1, e.node2, c.nodes+e.node2+'|', 1+c.level
      from cte c join
           edges e
           on c.node2 = e.node1 and
              c.nodes not like '|%'+e.node2+'%|'
     )

现在,通过此连接节点列表,为每个节点分配其连接的所有节点中的最小节点,包括其自身。这用作连接子图的标识符。也就是说,所有通过发布商相互关联的公司都具有相同的最低限度。

最后两步是枚举此最小值(作为GroupId)并将GroupId加回原始数据。

完整(我可能会添加测试)查询看起来像:

with edges as (
      select t1.company as node1, t2.company as node2
      from table1 t1 join
           table1 t2
           on t1.publisher = t2.publisher
     ),
     cte as (
      select e.node1, e.node2,
             cast('|'+e.node1+'|'+e.node2+'|' as varchar(max)) as nodes,
             1 as level
      from edges e
      union all
      select c.node1, e.node2,
             c.nodes+e.node2+'|',
             1+c.level
      from cte c join
           edges e
           on c.node2 = e.node1 and
              c.nodes not like '|%'+e.node2+'%|'
     ),
     nodes as (
       select node1,
              (case when min(node2) < node1 then min(node2) else node1 end
              ) as grp
       from cte
       group by node1
      )
select t.company, t.publisher, grp.GroupId
from table1 t join
     (select n.node1, dense_rank() over (order by grp) as GroupId
      from nodes n
     ) grp
     on t.company = grp.node1;

请注意,这适用于查找任何连接的子图。它没有假设任何特定数量的级别。

编辑:

表现问题令人烦恼。至少,使用Publisher上的索引,上述查询将运行得更好。更好的是采取@ MikaelEriksson的建议,并把边缘放在一个单独的表中。

另一个问题是,您是否在公司或出版商中寻找等效类。我采用了使用公司的方法,因为我认为它具有更好的“可解释性”(我的回应倾向是基于许多评论,这不能用CTE来完成)。

我猜你可以从中获得合理的性能,尽管这需要比OP中提供的更多的数据和系统知识。但是,很有可能最佳性能来自多种查询方法。

答案 2 :(得分:2)

这是我的解决方案SQL Fiddle

关系的本质需要循环,如我所知。


这是SQL:

--drop TABLE Table1

CREATE TABLE Table1
    ([row] int identity (1,1),GroupID INT NULL,[Company] varchar(2), [Publisher] varchar(2))
;

INSERT INTO Table1
    (Company, Publisher)
select
    left(newid(), 2), left(newid(), 2)

declare @i int = 1

while @i < 8
begin
    ;with cte(Company, Publisher) as (
        select
            left(newid(), 2), left(newid(), 2)
        from Table1
    )
    insert into Table1(Company, Publisher)
    select distinct c.Company, c.Publisher
    from cte as c
    where not exists (select * from Table1 as t where t.Company = c.Company and t.Publisher = c.Publisher)

    set @i = @i + 1
end;


CREATE NONCLUSTERED INDEX IX_Temp1 on Table1 (Company)
CREATE NONCLUSTERED INDEX IX_Temp2 on Table1 (Publisher)

declare @counter int=0
declare @row int=0
declare @lastnullcount int=0
declare @currentnullcount int=0

WHILE EXISTS (
  SELECT *
  FROM Table1
  where GroupID is null
  )
BEGIN
    SET @counter=@counter+1
    SET @lastnullcount =0

    SELECT TOP 1
        @row=[row]
    FROM Table1
    where GroupID is null
    order by [row] asc

    SELECT @currentnullcount=count(*) from table1 where groupid is null
    WHILE @lastnullcount <> @currentnullcount
    BEGIN
        SELECT @lastnullcount=count(*)
        from table1
        where groupid is null 

        UPDATE Table1
        SET GroupID=@counter
        WHERE [row]=@row

        UPDATE t2
        SET t2.GroupID=@counter
        FROM Table1 t1
        INNER JOIN Table1 t2 on t1.Company=t2.Company
        WHERE t1.GroupID=@counter
        AND t2.GroupID IS NULL

        UPDATE t2
        SET t2.GroupID=@counter
        FROM Table1 t1
        INNER JOIN Table1 t2 on t1.publisher=t2.publisher
        WHERE t1.GroupID=@counter
        AND t2.GroupID IS NULL

        SELECT @currentnullcount=count(*)
        from table1
        where groupid is null
    END
END

SELECT * FROM Table1

编辑: 正如我在实际桌子上所期望的那样添加索引,并且更符合Roman正在使用的其他数据集。

答案 3 :(得分:0)

您正在尝试查找图表中所有已连接的组件,这些组件只能以迭代方式完成。如果您知道任何连接组件的最大宽度(即您必须从一个公司/出版商到另一个公司/出版商的最大链接数),原则上您可以这样做:

SELECT
    MIN(x2.groupID) AS groupID,
    x1.Company,
    x1.Publisher
FROM Table1 AS x1
    INNER JOIN (
        SELECT
            MIN(x2.Company) AS groupID,
            x1.Company,
            x1.Publisher
        FROM Table1 AS x1
            INNER JOIN Table1 AS x2
            ON x1.Publisher = x2.Publisher
        GROUP BY
            x1.Publisher,
            x1.Company
    ) AS x2
    ON x1.Company = x2.Company
GROUP BY
    x1.Publisher,
    x1.Company;

您必须将子查询嵌套(公司和发布者上的交替连接,以及最深的子查询,即MIN(公司)而不是MIN(groupID))到最大迭代深度。

但是,我并不是真的推荐这个;在SQL之外做这件事会更干净。

免责声明:我对SQL Server 2012(或任何其他版本)一无所知;它可能具有某种额外的脚本功能,可以让你动态地进行迭代。

答案 4 :(得分:0)

这是一个使用XML的递归解决方案:

with a as ( -- recursive result, containing shorter subsets and duplicates
    select cast('<c>' + company + '</c>' as xml) as companies
          ,cast('<p>' + publisher + '</p>' as xml) as publishers
      from Table1

    union all

    select a.companies.query('for $c in distinct-values((for $i in /c return string($i),
                                                        sql:column("t.company")))
                          order by $c
                          return <c>{$c}</c>')
          ,a.publishers.query('for $p in distinct-values((for $i in /p return string($i),
                                                         sql:column("t.publisher")))
                          order by $p
                          return <p>{$p}</p>')
    from a join Table1 t
      on (   a.companies.exist('/c[text() = sql:column("t.company")]') = 0 
          or a.publishers.exist('/p[text() = sql:column("t.publisher")]') = 0)
     and (   a.companies.exist('/c[text() = sql:column("t.company")]') = 1
          or a.publishers.exist('/p[text() = sql:column("t.publisher")]') = 1)
), b as ( -- remove the shorter versions from earlier steps of the recursion and the duplicates
    select distinct -- distinct cannot work on xml types, hence cast to nvarchar
           cast(companies as nvarchar) as companies
          ,cast(publishers as nvarchar) as publishers
          ,DENSE_RANK() over(order by cast(companies as nvarchar), cast(publishers as nvarchar)) as groupid
     from a
    where not exists (select 1 from a as s -- s is a proper subset of a
                       where (cast('<s>' + cast(s.companies as varchar)
                                 + '</s><a>' + cast(a.companies as varchar) + '</a>' as xml)
                             ).value('if((count(/s/c) > count(/a/c))
                                         and (some $s in /s/c/text() satisfies
                                             (some $a in /a/c/text() satisfies $s = $a))
                                      ) then 1 else 0', 'int') = 1
                     )
      and not exists (select 1 from a as s -- s is a proper subset of a
                       where (cast('<s>' + cast(s.publishers as nvarchar)
                                 + '</s><a>' + cast(a.publishers as nvarchar) + '</a>' as xml)
                             ).value('if((count(/s/p) > count(/a/p))
                                         and (some $s in /s/p/text() satisfies
                                             (some $a in /a/p/text() satisfies $s = $a))
                                      ) then 1 else 0', 'int') = 1
                     )
), c as (  -- cast back to xml
    select cast(companies as xml) as companies
          ,cast(publishers as xml) as publishers
          ,groupid
      from b
)
select Co.company.value('(./text())[1]', 'varchar') as company
      ,Pu.publisher.value('(./text())[1]', 'varchar') as publisher
      ,c.groupid
  from c
       cross apply companies.nodes('/c') as Co(company)
       cross apply publishers.nodes('/p') as Pu(publisher)
 where exists(select 1 from Table1 t -- restrict to only the combinations that exist in the source
               where t.company = Co.company.value('(./text())[1]', 'varchar')
                 and t.publisher = Pu.publisher.value('(./text())[1]', 'varchar')
             )

公司集和发布者集保存在中间步骤的XML字段中,由于SQL Server的某些限制(例如无法分组或使用{{},因此需要在xml和nvarchar之间进行一些转换。 1}}在XML列上。

答案 5 :(得分:0)

挑战迟到了,而且由于SQLFiddle似乎已经停止了ATM,我将不得不猜测你的数据结构。然而,它似乎是一个有趣的挑战(它是=)所以这就是我用它做的:

设定:

IF OBJECT_ID('t_link') IS NOT NULL DROP TABLE t_link
IF OBJECT_ID('t_company') IS NOT NULL DROP TABLE t_company
IF OBJECT_ID('t_publisher') IS NOT NULL DROP TABLE t_publisher
IF OBJECT_ID('tempdb..#link_A') IS NOT NULL DROP TABLE #link_A
IF OBJECT_ID('tempdb..#link_B') IS NOT NULL DROP TABLE #link_B
GO

CREATE TABLE t_company ( company_id     int IDENTITY(1, 1) NOT NULL PRIMARY KEY,
                         company_name   varchar(100) NOT NULL)

GO 

CREATE TABLE t_publisher (publisher_id     int IDENTITY(1, 1) NOT NULL PRIMARY KEY,
                          publisher_name   varchar(100) NOT NULL)

CREATE TABLE t_link (company_id int NOT NULL FOREIGN KEY (company_id) REFERENCES t_company (company_id),
                     publisher_id int NOT NULL FOREIGN KEY (publisher_id) REFERENCES t_publisher (publisher_id),
                                PRIMARY KEY (company_id, publisher_id),
                     group_id int NULL
                             )
GO

-- example content


-- ROW   GROUPID     Company     Publisher
--1     1           A           Y
--2     1           A           X
--3     1           B           Y
--4     1           B           Z
--5     2           C           W
--6     2           C           P
--7     2           D           W


INSERT t_company (company_name) VALUES ('A'), ('B'), ('C'), ('D')
INSERT t_publisher (publisher_name) VALUES ('X'), ('Y'), ('Z'), ('W'), ('P')

INSERT t_link (company_id, publisher_id)
SELECT company_id, publisher_id
  FROM t_company, t_publisher
 WHERE (company_name = 'A' AND publisher_name = 'Y')
    OR (company_name = 'A' AND publisher_name = 'X')
    OR (company_name = 'B' AND publisher_name = 'Y')
    OR (company_name = 'B' AND publisher_name = 'Z')
    OR (company_name = 'C' AND publisher_name = 'W')
    OR (company_name = 'C' AND publisher_name = 'P')
    OR (company_name = 'D' AND publisher_name = 'W')




GO

/*
-- volume testing

TRUNCATE TABLE t_link
DELETE t_company
DELETE t_publisher


DECLARE @company_count   int = 1000,
        @publisher_count int = 450,
        @links_count     int = 800


INSERT t_company (company_name)
SELECT company_name    = Convert(varchar(100), NewID())
  FROM master.dbo.fn_int_list(1, @company_count) 

UPDATE STATISTICS t_company

INSERT t_publisher (publisher_name)
SELECT publisher_name  = Convert(varchar(100), NewID())
  FROM master.dbo.fn_int_list(1, @publisher_count) 

UPDATE STATISTICS t_publisher

-- Random links between the companies & publishers

DECLARE @count int
SELECT @count = 0

WHILE @count < @links_count
    BEGIN

        SELECT TOP 30 PERCENT row_id = IDENTITY(int, 1, 1), company_id = company_id + 0
          INTO #link_A
          FROM t_company
         ORDER BY NewID()

        SELECT TOP 30 PERCENT row_id = IDENTITY(int, 1, 1), publisher_id = publisher_id + 0
          INTO #link_B
          FROM t_publisher
         ORDER BY NewID()

        INSERT TOP (@links_count - @count) t_link (company_id, publisher_id)
        SELECT A.company_id,
               B.publisher_id
          FROM #link_A A
          JOIN #link_B B
            ON A.row_id = B.row_id
         WHERE NOT EXISTS ( SELECT *
                              FROM t_link old
                             WHERE old.company_id   = A.company_id
                               AND old.publisher_id = B.publisher_id)

        SELECT @count = @count + @@ROWCOUNT

        DROP TABLE #link_A
        DROP TABLE #link_B    
    END

*/

实际分组:

IF OBJECT_ID('tempdb..#links') IS NOT NULL DROP TABLE #links
GO

-- apply grouping

-- init
SELECT row_id = IDENTITY(int, 1, 1), 
       company_id,
       publisher_id,
       group_id = 0
  INTO #links
  FROM t_link

-- don't see an index that would be actually helpful here right-away, using row_id to avoid HEAP
CREATE CLUSTERED INDEX idx0 ON #links (row_id)
--CREATE INDEX idx1 ON #links (company_id)   
--CREATE INDEX idx2 ON #links (publisher_id)

UPDATE #links
   SET group_id = row_id


-- start grouping
WHILE @@ROWCOUNT > 0
    BEGIN  
        UPDATE #links
           SET group_id = new_group_id
          FROM #links upd
          CROSS APPLY (SELECT new_group_id = Min(group_id)
                         FROM #links new
                        WHERE new.company_id   = upd.company_id
                           OR new.publisher_id = upd.publisher_id 
                                     ) x
        WHERE upd.group_id > new_group_id

        -- select * from #links
    END


-- remove 'holes'
UPDATE #links
   SET group_id = (SELECT COUNT(DISTINCT o.group_id) 
                          FROM #links o
                         WHERE o.group_id <= upd.group_id)
  FROM #links upd

GO

UPDATE t_link
   SET group_id = new.group_id
  FROM t_link upd
  LEFT OUTER JOIN #links new
               ON new.company_id = upd.company_id
              AND new.publisher_id = upd.publisher_id

GO    
SELECT row = ROW_NUMBER() OVER (ORDER BY group_id, company_name, publisher_name),
       l.group_id,
       c.company_name, -- c.company_id,
       p.publisher_name -- , p.publisher_id
 from t_link l
 JOIN t_company c
   ON l.company_id = c.company_id
 JOIN t_publisher p 
   ON p.publisher_id = l.publisher_id
 ORDER BY 1

乍一看,这种方法还没有被其他人尝试过,有趣的是看看如何以各种方式完成这项工作......(首选不要提前阅读它们,因为它会破坏这个难题=)< / p>

结果看起来像预期的那样(据我了解的要求和示例而言)并且性能也不算太糟糕,尽管没有关于这应该起作用的记录数量的实际指示;不确定它会如何扩展但不要指望太多问题......