如何在Oracle SQL中将连接的子图ID分配给无向图中的每个节点?

时间:2019-02-12 20:12:17

标签: sql graph oracle11g

如何在from_nodeto_node定义的link_tbl中对一组链接进行分区。有1000万个节点和2500万个链接(每个节点不超过20个链接)。

例如
enter image description here

该图由三个不相交的子图组成。

  create table link_tbl as (
  select 'AY' as linkid, 'A' as from_node, 'Y' as to_node from dual union all
  select 'AB', 'A', 'B' from dual union all
  select 'CA', 'C', 'A' from dual union all      
  select 'GE', 'G', 'E' from dual union all
  select 'CB', 'C', 'B' from dual union all
  select 'EF', 'E', 'F' from dual union all
  select 'NM', 'N', 'M' from dual
  );
  --compute subnetid
  select * from link_tbl order by subnetid;

要获得具有subnetid个值的结果集?

enter image description here

我可以在Java中使用泛洪填充的变体来将子图ID分配给图的每个节点。但这可以用SQL完成吗?

伪代码:
 -对具有连续整数的节点进行排名。  -将链接表示为(node1*100M + node2) as number(19,0)  -联合将node1,node2与node2,node1并进行排序  -获取第一个节点作为锚点node,然后添加到subgraph_nodes' - Iterate over link table -添加node2 to subgraph_nodes if node1 is in (subgraph_nodes) AND node2 not in subgraph_nodes

这应将所有连接的节点添加到subgraph_nodes

这已经足够了,因为现在我可以将subgraph_id'添加到节点表中,并选择所有没有subgraph id的节点,然后重复进行子图分析。

有1千万个节点表示为连续的整数,有2500万个链接(每个节点不超过20个链接),表示为(from_node*100M + to_node) as ID, from_node, to_node

1 个答案:

答案 0 :(得分:2)

前段时间,我写了一个问题How to find all connected subgraphs of an undirected graph的答案。它是为SQL Server编写的,但是Oracle支持标准的递归查询,因此很容易将其转换为Oracle。使用特定于Oracle的结构可以更有效地编写它。

样本数据

create table link_tbl as (
  select 'AY' as linkid, 'A' as from_node, 'Y' as to_node from dual union all
  select 'AB', 'A', 'B' from dual union all
  select 'CA', 'C', 'A' from dual union all      
  select 'GE', 'G', 'E' from dual union all
  select 'CB', 'C', 'B' from dual union all
  select 'EF', 'E', 'F' from dual union all
  select 'NM', 'N', 'M' from dual
);

查询

WITH
CTE_Nodes
AS
(
    SELECT from_node AS Node
    FROM link_tbl

    UNION

    SELECT to_node AS Node
    FROM link_tbl
)
,CTE_Pairs
AS
(
    SELECT from_node AS Node1, to_node AS Node2
    FROM link_tbl
    WHERE from_node <> to_node

    UNION

    SELECT to_node AS Node1, from_node AS Node2
    FROM link_tbl
    WHERE from_node <> to_node
)
,CTE_Recursive (AnchorNode, Node1, Node2, NodePath, Lvl)
AS
(
    SELECT
        CAST(CTE_Nodes.Node AS varchar(2000)) AS AnchorNode
        , Node1
        , Node2
        , CAST(',' || Node1 || ',' || Node2 || ',' AS varchar(2000)) AS NodePath
        , 1 AS Lvl
    FROM 
        CTE_Pairs
        INNER JOIN CTE_Nodes ON CTE_Nodes.Node = CTE_Pairs.Node1

    UNION ALL

    SELECT 
        CTE_Recursive.AnchorNode
        , CTE_Pairs.Node1
        , CTE_Pairs.Node2
        , CAST(CTE_Recursive.NodePath || CTE_Pairs.Node2 || ',' AS varchar(2000)) AS NodePath
        , CTE_Recursive.Lvl + 1 AS Lvl
    FROM
        CTE_Pairs
        INNER JOIN CTE_Recursive ON CTE_Recursive.Node2 = CTE_Pairs.Node1
    WHERE
        CTE_Recursive.NodePath NOT LIKE CAST('%,' || CTE_Pairs.Node2 || ',%' AS varchar(2000))
)
,CTE_RecursionResult
AS
(
    SELECT AnchorNode, Node1, Node2
    FROM CTE_Recursive
)
,CTE_CleanResult
AS
(
    SELECT AnchorNode, Node1 AS Node
    FROM CTE_RecursionResult

    UNION

    SELECT AnchorNode, Node2 AS Node
    FROM CTE_RecursionResult
)
SELECT
    CTE_Nodes.Node
    ,LISTAGG(CTE_CleanResult.Node, ',') WITHIN GROUP (ORDER BY CTE_CleanResult.Node) AS GroupMembers
    ,DENSE_RANK() OVER (ORDER BY LISTAGG(CTE_CleanResult.Node, ',') WITHIN GROUP (ORDER BY CTE_CleanResult.Node)) AS GroupID
FROM
    CTE_Nodes
    INNER JOIN CTE_CleanResult ON CTE_CleanResult.AnchorNode = CTE_Nodes.Node
GROUP BY
    CTE_Nodes.Node
ORDER BY
    GroupID
    ,CTE_Nodes.Node
;

结果

+------+--------------+---------+
| NODE | GROUPMEMBERS | GROUPID |
+------+--------------+---------+
| A    | A,B,C,Y      |       1 |
| B    | A,B,C,Y      |       1 |
| C    | A,B,C,Y      |       1 |
| Y    | A,B,C,Y      |       1 |
| E    | E,F,G        |       2 |
| F    | E,F,G        |       2 |
| G    | E,F,G        |       2 |
| M    | M,N          |       3 |
| N    | M,N          |       3 |
+------+--------------+---------+

https://dbfiddle.uk/?rdbms=oracle_11.2&fiddle=e61cf73824e7718a4686430ccd7398e7

工作原理

CTE_Nodes

CTE_Nodes给出了同时出现在from_nodeto_node列中的所有节点的列表。 由于它们可以按任何顺序出现,因此我们UNION将两列一起。 UNION还会删除所有重复项。

+------+
| NODE |
+------+
| A    |
| B    |
| C    |
| E    |
| F    |
| G    |
| M    |
| N    |
| Y    |
+------+

CTE_Pairs

CTE_Pairs给出了两个方向上图形所有边的列表。同样,UNION用于删除所有重复项。

+-------+-------+
| NODE1 | NODE2 |
+-------+-------+
| A     | B     |
| A     | C     |
| A     | Y     |
| B     | A     |
| B     | C     |
| C     | A     |
| C     | B     |
| E     | F     |
| E     | G     |
| F     | E     |
| G     | E     |
| M     | N     |
| N     | M     |
| Y     | A     |
+-------+-------+

CTE_Recursive

CTE_Recursive是查询的主要部分,它从每个唯一的Node开始递归遍历该图。 这些起始行由UNION ALL的第一部分产生。 UNION ALL的第二部分递归连接到自身,将Node2链接到Node1。 由于我们预先制作了CTE_Pairs,并且所有边都在两个方向上写入,因此我们始终只能将Node2链接到Node1,并且将在图形中获得所有路径。 同时,查询将生成NodePath-到目前为止已遍历的用逗号分隔的节点字符串。 它在WHERE过滤器中使用:

CTE_Recursive.NodePath NOT LIKE CAST('%,' || CTE_Pairs.Node2 || ',%' AS varchar(2000))

当我们遇到之前包含在Path中的Node时,由于连接的节点列表已用尽,因此递归停止。 AnchorNode是递归的起始节点,以后将用于对结果进行分组。 Lvl并未真正使用,我将其包括在内是为了更好地了解正在发生的事情。

+------------+-------+-------+-----------+-----+
| ANCHORNODE | NODE1 | NODE2 | NODEPATH  | LVL |
+------------+-------+-------+-----------+-----+
| A          | A     | Y     | ,A,Y,     |   1 |
| A          | A     | C     | ,A,C,     |   1 |
| A          | A     | B     | ,A,B,     |   1 |
| B          | B     | C     | ,B,C,     |   1 |
| B          | B     | A     | ,B,A,     |   1 |
| C          | C     | B     | ,C,B,     |   1 |
| C          | C     | A     | ,C,A,     |   1 |
| E          | E     | G     | ,E,G,     |   1 |
| E          | E     | F     | ,E,F,     |   1 |
| F          | F     | E     | ,F,E,     |   1 |
| G          | G     | E     | ,G,E,     |   1 |
| M          | M     | N     | ,M,N,     |   1 |
| N          | N     | M     | ,N,M,     |   1 |
| Y          | Y     | A     | ,Y,A,     |   1 |
| Y          | A     | B     | ,Y,A,B,   |   2 |
| C          | A     | B     | ,C,A,B,   |   2 |
| Y          | A     | C     | ,Y,A,C,   |   2 |
| B          | A     | C     | ,B,A,C,   |   2 |
| C          | A     | Y     | ,C,A,Y,   |   2 |
| B          | A     | Y     | ,B,A,Y,   |   2 |
| C          | B     | A     | ,C,B,A,   |   2 |
| A          | B     | C     | ,A,B,C,   |   2 |
| B          | C     | A     | ,B,C,A,   |   2 |
| A          | C     | B     | ,A,C,B,   |   2 |
| G          | E     | F     | ,G,E,F,   |   2 |
| F          | E     | G     | ,F,E,G,   |   2 |
| B          | A     | Y     | ,B,C,A,Y, |   3 |
| C          | A     | Y     | ,C,B,A,Y, |   3 |
| Y          | B     | C     | ,Y,A,B,C, |   3 |
| Y          | C     | B     | ,Y,A,C,B, |   3 |
+------------+-------+-------+-----------+-----+

CTE_CleanResult

CTE_CleanResult仅保留CTE_Recursive中的相关部分,并再次使用Node1合并Node2UNION

+------------+------+
| ANCHORNODE | NODE |
+------------+------+
| A          | A    |
| A          | B    |
| A          | C    |
| A          | Y    |
| B          | A    |
| B          | B    |
| B          | C    |
| B          | Y    |
| C          | A    |
| C          | B    |
| C          | C    |
| C          | Y    |
| E          | E    |
| E          | F    |
| E          | G    |
| F          | E    |
| F          | F    |
| F          | G    |
| G          | E    |
| G          | F    |
| G          | G    |
| M          | M    |
| M          | N    |
| N          | M    |
| N          | N    |
| Y          | A    |
| Y          | B    |
| Y          | C    |
| Y          | Y    |
+------------+------+

最终选择

现在,我们需要为每个Node建立一个用逗号分隔的AnchorNode值的字符串。 LISTAGG做到了。 DENSE_RANK()为每个GroupID计算AnchorNode个数字。


效率

您的表相当大,因此尝试在上面的单个查询中一次查找所有组可能效率很低。

提高效率的一种方法是不对整个数据集使用单个查询。不要尝试同时查找所有子网/组。将起点限制为一个节点。将WHERE CTE_Nodes.Node = 'some node'添加到CTE_Recursive的第一部分。该查询将找到一个子网的所有节点。从大表中删除这些找到的节点,选择另一个起始节点,循环重复直到大表为空。如果将CTE_NodesCTE_Pairs实例化为带有索引的临时表,则也可能会有所帮助。

我从没与Oracle合作过,也不知道它对程序代码的语法,所以我将在下面用伪代码编写。

准备临时表

CREATE TABLE Nodes AS
(
    SELECT from_node AS Node
    FROM link_tbl

    UNION

    SELECT to_node AS Node
    FROM link_tbl
);
CREATE INDEX IX_Node ON Nodes (Node);

CREATE TABLE Pairs AS
(
    SELECT from_node AS Node1, to_node AS Node2
    FROM link_tbl
    WHERE from_node <> to_node

    UNION

    SELECT to_node AS Node1, from_node AS Node2
    FROM link_tbl
    WHERE from_node <> to_node
);
CREATE INDEX IX_Node1 ON Pairs (Node1);
CREATE INDEX IX_Node2 ON Pairs (Node2);

CREATE TABLE Subgraph AS
(
    SELECT Node FROM Nodes WHERE 1=0
);

CREATE TABLE Result
(
    GroupID int NOT NULL,
    Node varchar(10) NOT NULL
);

SET :GroupID = 0;

主循环开始

Node中选择一个Nodes。任何行都可以。将Node保存到变量中。再次,我不知道它的正确Oracle语法。

SELECT :N = Node FROM Nodes WHERE rownum=1;

如果Nodes为空,则停止循环。

SET :GroupID = :GroupID + 1;

运行递归查询,从上面选择的一个特定Node开始递归。

INSERT INTO Subgraph (Node)
WITH
CTE_Recursive (AnchorNode, Node1, Node2, NodePath, Lvl)
AS
(
    SELECT
        CAST(Nodes.Node AS varchar(2000)) AS AnchorNode
        , Node1
        , Node2
        , CAST(',' || Node1 || ',' || Node2 || ',' AS varchar(2000)) AS NodePath
        , 1 AS Lvl
    FROM 
        Pairs
        INNER JOIN Nodes ON Nodes.Node = Pairs.Node1
    WHERE
        Nodes.Node = :N  -- 'A'
        -- use variable here, don't know what the syntax is

    UNION ALL

    SELECT 
        CTE_Recursive.AnchorNode
        , Pairs.Node1
        , Pairs.Node2
        , CAST(CTE_Recursive.NodePath || Pairs.Node2 || ',' AS varchar(2000)) AS NodePath
        , CTE_Recursive.Lvl + 1 AS Lvl
    FROM
        Pairs
        INNER JOIN CTE_Recursive ON CTE_Recursive.Node2 = Pairs.Node1
    WHERE
        CTE_Recursive.NodePath NOT LIKE CAST('%,' || Pairs.Node2 || ',%' AS varchar(2000))
)
,CTE_Result
AS
(
    SELECT Node1 AS Node
    FROM CTE_Recursive

    UNION

    SELECT Node2 AS Node
    FROM CTE_Recursive
)
SELECT Node FROM CTE_Result
;

此查询将返回连接到给定起始节点的所有节点,即形成子图的那些节点。 将其结果集保存到Subgraph临时表中。

将结果追加到最终Result表中,并为找到的子图分配ID。

INSERT INTO Result (GroupID, Node)
SELECT :GroupID, Node
FROM Subgraph;

NodesPairs中删除已处理的节点。

DELETE FROM Nodes
WHERE Node IN (SELECT Node FROM Subgraph)
;

DELETE FROM Pairs
WHERE 
    Node1 IN (SELECT Node FROM Subgraph)
    OR
    Node2 IN (SELECT Node FROM Subgraph)
;

清理Subgraph

DELETE FROM Subgraph; 

返回循环的开始。

Result表的最后,所有节点都具有子图的相应ID。


实际上,您可以进一步简化它。您不需要Nodes表,Pairs就足够了。