多个自联接以查找传递性子集(clique)

时间:2019-07-10 23:14:45

标签: tsql graph-theory self-join sql-server-2017

最简单的说,我有一个表代表一个关系。表格中的行代表我的关系中的对。换句话说,第一行表示id为1的id与4的id有关,而id为4的id与1的id有关。希望,您不难发现我的关系是对称的,尽管表格以简洁的形式显示了这种对称性。

+-----+-----+  
| id1 | id2 |  
+-----+-----+  
|   1 |   4 |  
|   3 |   1 |  
|   2 |   1 |  
|   2 |   3 |  
|   2 |   4 |  
|   5 |   1 |  
+-----+-----+

编辑 该表旨在简要显示以下关系:
{(1,4),(4,1),(3,1),(1,3),(2,1),(1,2),(2,3),(3,2),( 2,4),(4,2),(5,1),(1,5)}。可以通过下面的无向图将其可视化。
Picture of Relation represented by Test table.

CREATE TABLE Test (
id1 int not null,
id2 int not null);

INSERT INTO Test
VALUES
(1,4),
(3,1),
(2,1),
(2,3),
(2,4),
(5,1);

我想在表格中标识传递子集(cliques)。
编辑 例如,我想标识由以下事实证明的传递子集:id为3与id为1,id为1与id为2的事实意味着id为3与id有关,共2个。(在无向图照片中,这些可以看作是三角形。尽管在最佳情况下,我希望能够列出其他complete个子图,这些子图比三角形大,如果它们存在于原始表格/图形。)

我尝试执行以下操作,但是结果集比我想要的要大。我希望有一个更简单的方法。

select t1.id1, t1.id2, t2.id1, t2.id2, t3.id1, t3.id2
from test as t1
    join test as t2
        on t1.id1 = t2.id1
        or t1.id2 = t2.id2
        or t1.id1 = t2.id2
        or t1.id2 = t2.id1
    join test as t3
        on t2.id1 = t3.id1
        or t2.id2 = t3.id2
        or t2.id1 = t3.id2
        or t2.id2 = t3.id1
where
    not
    (
        t1.id1 = t2.id1
        and
        t1.id2 = t2.id2
    )
    and not
    (
        t2.id1 = t3.id1
        and
        t2.id2 = t3.id2
    )
    and not
    (
        t1.id1 = t3.id1
        and
        t1.id2 = t3.id2
    )
    and
    (
        (
            t3.id1 = t1.id1
            or
            t3.id1 = t1.id2
            or
            t3.id1 = t2.id1
            or
            t3.id1 = t2.id2
        )
        and
        (
            t3.id2 = t1.id1
            or
            t3.id2 = t1.id2
            or
            t3.id2 = t2.id1
            or
            t3.id2 = t2.id2
        )
    );

实际输出:

+-----+-----+-----+-----+-----+-----+
| id1 | id2 | id1 | id2 | id1 | id2 |
+-----+-----+-----+-----+-----+-----+
|   1 |   4 |   2 |   4 |   2 |   1 |
|   1 |   4 |   2 |   1 |   2 |   4 |
|   3 |   1 |   2 |   3 |   2 |   1 |
|   3 |   1 |   2 |   1 |   2 |   3 |
|   2 |   1 |   2 |   4 |   1 |   4 |
|   2 |   1 |   2 |   3 |   3 |   1 |
|   2 |   1 |   3 |   1 |   2 |   3 |
|   2 |   1 |   1 |   4 |   2 |   4 |
|   2 |   3 |   2 |   1 |   3 |   1 |
|   2 |   3 |   3 |   1 |   2 |   1 |
|   2 |   4 |   2 |   1 |   1 |   4 |
|   2 |   4 |   1 |   4 |   2 |   1 |
+-----+-----+-----+-----+-----+-----+

预期结果集将只有两行。每行将代表一个传递关系,该传递关系是原始关系的子集。

╔═════╦═════╦═════╦═════╦═════╦═════╗
║ id1 ║ id2 ║ id1 ║ id2 ║ id1 ║ id2 ║
╠═════╬═════╬═════╬═════╬═════╬═════╣
║   1 ║   4 ║   2 ║   4 ║   2 ║   1 ║
║   3 ║   1 ║   2 ║   1 ║   2 ║   3 ║
╚═════╩═════╩═════╩═════╩═════╩═════╝

编辑 预期的输出也可能像

╔═════╦═════╦═════╗
║ id1 ║ id2 ║ id3 ║
╠═════╬═════╬═════╣
║   1 ║   4 ║   2 ║
║   3 ║   1 ║   2 ║
╚═════╩═════╩═════╝,

更简单。我只需要显示集合的事实
{(1,4),(4,1),(2,4),(4,2),(2,1),(1,2)}

{(3,1),(1,3),(2,1),(1,2),(2,3),(3,2)}
是原始关系的适当子集,本身就是传递关系。我使用的定义是,当且仅当关系R是可传递的 ∀a∀b∀c((a,b)∈R∧(b,c)∈R→(a,c)∈R)。换句话说,我正在尝试查找同时也是subgraphs的所有complete graphs

我是图形理论的新手,但似乎我的问题与clique problem类似,我正在寻找包含3个或更多顶点的集团。我会接受只返回具有3个顶点的集团的解决方案作为答案。我的问题类似于this的问题。但是,那里提出的解决方案似乎并没有使用我想要的每个集团与集团内部其他每个顶点都相连的集团的定义。

Here是我使用Java发现的一种算法。希望这将有助于使用SQL的实现。

2 个答案:

答案 0 :(得分:2)

以前,我需要使用传递闭包来创建数据集群。最好的方法是使用SQLCLR。这是GitHub代码(也有指向详细链接的文章)

https://github.com/yorek/non-scalar-uda-transitive-closure

那可能是一个很好的起点。您还可以更精确地了解样本中输入数据的预期结果吗?

答案 1 :(得分:0)

这是解决方案。它基于这样的想法,即完整的图包含其子图的所有可能组合。代码在这里,我将在周末详细评论它,但是这种情况下,我至少不能在星期一拥有正确的代码。请注意,这是一种蛮力的方法,如果您需要大于30个节点的图形,则将无法使用。我仍然认为这是“横向思考”的一个很好的例子。享受:

/*
    Create table holding graph data.
    Id1 and Id2 represent the vertex of the graph.
    (Id1, Id2) represent and edge.

    https://stackoverflow.com/questions/56979737/multiple-self-joins-to-find-transitive-subsets-cliques/56979901#56979901
*/
DROP TABLE IF EXISTS #Graph;
CREATE TABLE #Graph (Id1 INT, Id2 INT);
INSERT INTO 
    #Graph
VALUES
    (1,2)
    ,(1,3)
    ,(1,4)
    ,(2,3)
    ,(2,4)
    ,(5,1)
    --,(4,3) -- Uncomment this to create a complete subgraph of 4 vertex
;
GO

/*
    Create Numbers Table
*/
DROP TABLE IF EXISTS #Numbers;
SELECT TOP (100000)
    ROW_NUMBER() OVER(ORDER BY A.[object_id]) AS Num
INTO 
    #Numbers
FROM 
    sys.[all_columns] a CROSS JOIN sys.[all_columns] b
ORDER BY 
    Num
GO

/*
    Make sure Id1 is always lower then Id2.
    This can be done as the graph is undirected
*/

DROP TABLE IF EXISTS #Graph2;
SELECT DISTINCT
    CASE WHEN Id1<Id2 THEN Id1 ELSE Id2 END AS Id1,  
    CASE WHEN Id1<Id2 THEN Id2 ELSE Id1 END AS Id2  
INTO
    #Graph2
FROM 
    #Graph;
GO

/*
    Turn edges into single columns
*/
DROP TABLE IF EXISTS #Graph3;
SELECT 
    CAST(Id1 AS VARCHAR(MAX)) + '>'  + CAST(Id2 AS VARCHAR(MAX)) COLLATE Latin1_General_BIN2 AS [Edge] 
INTO 
    #Graph3 
FROM 
    #Graph2;

/*
    Get the list of all the unique vertexes
*/
DROP TABLE IF EXISTS #Vertex;
WITH cte AS
(
    SELECT Id1 AS Id FROM #Graph
    UNION 
    SELECT Id2 AS Id FROM #Graph
)
SELECT * INTO #Vertex FROM cte;

/*
    Given a complete graph with "n" vertexes, 
    calculate all the possibile complete cyclic subgraphs
*/
-- From https://stackoverflow.com/questions/3686062/generate-all-combinations-in-sql
-- And changed to return all combinations complete cyclic subgraphs 
DROP TABLE IF EXISTS #AllCyclicVertex;
DECLARE @VertexCount INT = (SELECT COUNT(*) FROM [#Vertex]);
WITH Nums AS 
(
    SELECT 
        Num
    FROM 
        #Numbers
    WHERE 
        Num BETWEEN 0 AND POWER(2, @VertexCount) - 1
), BaseSet AS 
(
    SELECT 
        I = POWER(2, ROW_NUMBER() OVER (ORDER BY [Id]) - 1), *
   FROM 
        [#Vertex]
), Combos AS 
(
    SELECT
        CombId = N.Num,
        S.Id,
        K = COUNT(*) OVER (PARTITION BY N.Num)
   FROM
        Nums AS N
    INNER JOIN 
        BaseSet AS S ON N.Num & S.I <> 0
)
SELECT
    DENSE_RANK() OVER (ORDER BY K, [CombID]) AS CombNum,
    K,
    Id
INTO
    #AllCyclicVertex
FROM 
    Combos
WHERE 
    K BETWEEN 3 AND @VertexCount
ORDER BY 
    CombNum, Id;
GO

--SELECT * FROM [#AllCyclicVertex]

/*
    Calculate the edges for the calculated cyclic graphs
*/
DROP TABLE IF EXISTS #WellKnownPatterns;
CREATE TABLE #WellKnownPatterns ([Name] VARCHAR(100), [Id1] INT, [Id2] INT, [Edge] VARCHAR(100) COLLATE Latin1_General_BIN2);

INSERT INTO #WellKnownPatterns 
    ([Name], [Id1], [Id2], [Edge])
SELECT 
    CAST(a.[CombNum] AS VARCHAR(100)) + '/' + CAST(a.[K] AS VARCHAR(100)),
    a.Id AS Id1, 
    b.Id AS Id2,
    CAST(a.[Id] AS VARCHAR(MAX)) + '>'  + CAST(b.[Id] AS VARCHAR(MAX)) AS [Edge]
FROM 
    #AllCyclicVertex a 
INNER JOIN 
    #AllCyclicVertex b ON b.id > a.id AND a.[CombNum] = b.[CombNum]
;

-- SELECT * FROM [#WellKnownPatterns]

/*
    Now take from the original set only those 
    who are EXACT RELATIONAL DIVISION of a well-known cyclic graph
*/

WITH cte AS
(
    SELECT * FROM #Graph3
),
cte2 AS 
(
    SELECT 
        COUNT(*) OVER (PARTITION BY [Name]) AS [EdgeCount],
        * 
    FROM 
        #WellKnownPatterns
)
SELECT
    T1.[Name]
FROM
    cte2 AS T1
LEFT OUTER JOIN
    cte AS S ON T1.[Edge] = S.[Edge]
GROUP BY
    T1.[Name]
HAVING 
    COUNT(S.[Edge]) = MIN(T1.[EdgeCount])
GO

-- Test a solution
SELECT * FROM [#WellKnownPatterns] WHERE [Name] = '1/3'