如何在from_node
和to_node
定义的link_tbl中对一组链接进行分区。有1000万个节点和2500万个链接(每个节点不超过20个链接)。
该图由三个不相交的子图组成。
create table link_tbl as (
select 'AY' as linkid, 'A' as from_node, 'Y' as to_node from dual union all
select 'AB', 'A', 'B' from dual union all
select 'CA', 'C', 'A' from dual union all
select 'GE', 'G', 'E' from dual union all
select 'CB', 'C', 'B' from dual union all
select 'EF', 'E', 'F' from dual union all
select 'NM', 'N', 'M' from dual
);
--compute subnetid
select * from link_tbl order by subnetid;
要获得具有subnetid
个值的结果集?
我可以在Java中使用泛洪填充的变体来将子图ID分配给图的每个节点。但这可以用SQL完成吗?
伪代码:
-对具有连续整数的节点进行排名。
-将链接表示为(node1*100M + node2) as number(19,0)
-联合将node1,node2与node2,node1并进行排序
-获取第一个节点作为锚点node
,然后添加到subgraph_nodes'
- Iterate over link table
-
添加node2 to subgraph_nodes if node1 is in (subgraph_nodes) AND node2 not in subgraph_nodes
这应将所有连接的节点添加到subgraph_nodes
这已经足够了,因为现在我可以将subgraph_id'添加到节点表中,并选择所有没有subgraph id
的节点,然后重复进行子图分析。
有1千万个节点表示为连续的整数,有2500万个链接(每个节点不超过20个链接),表示为(from_node*100M + to_node) as ID, from_node, to_node
。
答案 0 :(得分:2)
前段时间,我写了一个问题How to find all connected subgraphs of an undirected graph的答案。它是为SQL Server编写的,但是Oracle支持标准的递归查询,因此很容易将其转换为Oracle。使用特定于Oracle的结构可以更有效地编写它。
样本数据
create table link_tbl as (
select 'AY' as linkid, 'A' as from_node, 'Y' as to_node from dual union all
select 'AB', 'A', 'B' from dual union all
select 'CA', 'C', 'A' from dual union all
select 'GE', 'G', 'E' from dual union all
select 'CB', 'C', 'B' from dual union all
select 'EF', 'E', 'F' from dual union all
select 'NM', 'N', 'M' from dual
);
查询
WITH
CTE_Nodes
AS
(
SELECT from_node AS Node
FROM link_tbl
UNION
SELECT to_node AS Node
FROM link_tbl
)
,CTE_Pairs
AS
(
SELECT from_node AS Node1, to_node AS Node2
FROM link_tbl
WHERE from_node <> to_node
UNION
SELECT to_node AS Node1, from_node AS Node2
FROM link_tbl
WHERE from_node <> to_node
)
,CTE_Recursive (AnchorNode, Node1, Node2, NodePath, Lvl)
AS
(
SELECT
CAST(CTE_Nodes.Node AS varchar(2000)) AS AnchorNode
, Node1
, Node2
, CAST(',' || Node1 || ',' || Node2 || ',' AS varchar(2000)) AS NodePath
, 1 AS Lvl
FROM
CTE_Pairs
INNER JOIN CTE_Nodes ON CTE_Nodes.Node = CTE_Pairs.Node1
UNION ALL
SELECT
CTE_Recursive.AnchorNode
, CTE_Pairs.Node1
, CTE_Pairs.Node2
, CAST(CTE_Recursive.NodePath || CTE_Pairs.Node2 || ',' AS varchar(2000)) AS NodePath
, CTE_Recursive.Lvl + 1 AS Lvl
FROM
CTE_Pairs
INNER JOIN CTE_Recursive ON CTE_Recursive.Node2 = CTE_Pairs.Node1
WHERE
CTE_Recursive.NodePath NOT LIKE CAST('%,' || CTE_Pairs.Node2 || ',%' AS varchar(2000))
)
,CTE_RecursionResult
AS
(
SELECT AnchorNode, Node1, Node2
FROM CTE_Recursive
)
,CTE_CleanResult
AS
(
SELECT AnchorNode, Node1 AS Node
FROM CTE_RecursionResult
UNION
SELECT AnchorNode, Node2 AS Node
FROM CTE_RecursionResult
)
SELECT
CTE_Nodes.Node
,LISTAGG(CTE_CleanResult.Node, ',') WITHIN GROUP (ORDER BY CTE_CleanResult.Node) AS GroupMembers
,DENSE_RANK() OVER (ORDER BY LISTAGG(CTE_CleanResult.Node, ',') WITHIN GROUP (ORDER BY CTE_CleanResult.Node)) AS GroupID
FROM
CTE_Nodes
INNER JOIN CTE_CleanResult ON CTE_CleanResult.AnchorNode = CTE_Nodes.Node
GROUP BY
CTE_Nodes.Node
ORDER BY
GroupID
,CTE_Nodes.Node
;
结果
+------+--------------+---------+
| NODE | GROUPMEMBERS | GROUPID |
+------+--------------+---------+
| A | A,B,C,Y | 1 |
| B | A,B,C,Y | 1 |
| C | A,B,C,Y | 1 |
| Y | A,B,C,Y | 1 |
| E | E,F,G | 2 |
| F | E,F,G | 2 |
| G | E,F,G | 2 |
| M | M,N | 3 |
| N | M,N | 3 |
+------+--------------+---------+
https://dbfiddle.uk/?rdbms=oracle_11.2&fiddle=e61cf73824e7718a4686430ccd7398e7
CTE_Nodes
CTE_Nodes
给出了同时出现在from_node
和to_node
列中的所有节点的列表。
由于它们可以按任何顺序出现,因此我们UNION
将两列一起。 UNION
还会删除所有重复项。
+------+
| NODE |
+------+
| A |
| B |
| C |
| E |
| F |
| G |
| M |
| N |
| Y |
+------+
CTE_Pairs
CTE_Pairs
给出了两个方向上图形所有边的列表。同样,UNION
用于删除所有重复项。
+-------+-------+
| NODE1 | NODE2 |
+-------+-------+
| A | B |
| A | C |
| A | Y |
| B | A |
| B | C |
| C | A |
| C | B |
| E | F |
| E | G |
| F | E |
| G | E |
| M | N |
| N | M |
| Y | A |
+-------+-------+
CTE_Recursive
CTE_Recursive
是查询的主要部分,它从每个唯一的Node开始递归遍历该图。
这些起始行由UNION ALL
的第一部分产生。
UNION ALL
的第二部分递归连接到自身,将Node2
链接到Node1
。
由于我们预先制作了CTE_Pairs
,并且所有边都在两个方向上写入,因此我们始终只能将Node2
链接到Node1
,并且将在图形中获得所有路径。
同时,查询将生成NodePath
-到目前为止已遍历的用逗号分隔的节点字符串。
它在WHERE
过滤器中使用:
CTE_Recursive.NodePath NOT LIKE CAST('%,' || CTE_Pairs.Node2 || ',%' AS varchar(2000))
当我们遇到之前包含在Path中的Node时,由于连接的节点列表已用尽,因此递归停止。
AnchorNode
是递归的起始节点,以后将用于对结果进行分组。
Lvl
并未真正使用,我将其包括在内是为了更好地了解正在发生的事情。
+------------+-------+-------+-----------+-----+
| ANCHORNODE | NODE1 | NODE2 | NODEPATH | LVL |
+------------+-------+-------+-----------+-----+
| A | A | Y | ,A,Y, | 1 |
| A | A | C | ,A,C, | 1 |
| A | A | B | ,A,B, | 1 |
| B | B | C | ,B,C, | 1 |
| B | B | A | ,B,A, | 1 |
| C | C | B | ,C,B, | 1 |
| C | C | A | ,C,A, | 1 |
| E | E | G | ,E,G, | 1 |
| E | E | F | ,E,F, | 1 |
| F | F | E | ,F,E, | 1 |
| G | G | E | ,G,E, | 1 |
| M | M | N | ,M,N, | 1 |
| N | N | M | ,N,M, | 1 |
| Y | Y | A | ,Y,A, | 1 |
| Y | A | B | ,Y,A,B, | 2 |
| C | A | B | ,C,A,B, | 2 |
| Y | A | C | ,Y,A,C, | 2 |
| B | A | C | ,B,A,C, | 2 |
| C | A | Y | ,C,A,Y, | 2 |
| B | A | Y | ,B,A,Y, | 2 |
| C | B | A | ,C,B,A, | 2 |
| A | B | C | ,A,B,C, | 2 |
| B | C | A | ,B,C,A, | 2 |
| A | C | B | ,A,C,B, | 2 |
| G | E | F | ,G,E,F, | 2 |
| F | E | G | ,F,E,G, | 2 |
| B | A | Y | ,B,C,A,Y, | 3 |
| C | A | Y | ,C,B,A,Y, | 3 |
| Y | B | C | ,Y,A,B,C, | 3 |
| Y | C | B | ,Y,A,C,B, | 3 |
+------------+-------+-------+-----------+-----+
CTE_CleanResult
CTE_CleanResult
仅保留CTE_Recursive
中的相关部分,并再次使用Node1
合并Node2
和UNION
。
+------------+------+
| ANCHORNODE | NODE |
+------------+------+
| A | A |
| A | B |
| A | C |
| A | Y |
| B | A |
| B | B |
| B | C |
| B | Y |
| C | A |
| C | B |
| C | C |
| C | Y |
| E | E |
| E | F |
| E | G |
| F | E |
| F | F |
| F | G |
| G | E |
| G | F |
| G | G |
| M | M |
| M | N |
| N | M |
| N | N |
| Y | A |
| Y | B |
| Y | C |
| Y | Y |
+------------+------+
最终选择
现在,我们需要为每个Node
建立一个用逗号分隔的AnchorNode
值的字符串。
LISTAGG
做到了。
DENSE_RANK()
为每个GroupID
计算AnchorNode
个数字。
您的表相当大,因此尝试在上面的单个查询中一次查找所有组可能效率很低。
提高效率的一种方法是不对整个数据集使用单个查询。不要尝试同时查找所有子网/组。将起点限制为一个节点。将WHERE CTE_Nodes.Node = 'some node'
添加到CTE_Recursive
的第一部分。该查询将找到一个子网的所有节点。从大表中删除这些找到的节点,选择另一个起始节点,循环重复直到大表为空。如果将CTE_Nodes
和CTE_Pairs
实例化为带有索引的临时表,则也可能会有所帮助。
我从没与Oracle合作过,也不知道它对程序代码的语法,所以我将在下面用伪代码编写。
准备临时表
CREATE TABLE Nodes AS
(
SELECT from_node AS Node
FROM link_tbl
UNION
SELECT to_node AS Node
FROM link_tbl
);
CREATE INDEX IX_Node ON Nodes (Node);
CREATE TABLE Pairs AS
(
SELECT from_node AS Node1, to_node AS Node2
FROM link_tbl
WHERE from_node <> to_node
UNION
SELECT to_node AS Node1, from_node AS Node2
FROM link_tbl
WHERE from_node <> to_node
);
CREATE INDEX IX_Node1 ON Pairs (Node1);
CREATE INDEX IX_Node2 ON Pairs (Node2);
CREATE TABLE Subgraph AS
(
SELECT Node FROM Nodes WHERE 1=0
);
CREATE TABLE Result
(
GroupID int NOT NULL,
Node varchar(10) NOT NULL
);
SET :GroupID = 0;
主循环开始
从Node
中选择一个Nodes
。任何行都可以。将Node
保存到变量中。再次,我不知道它的正确Oracle语法。
SELECT :N = Node FROM Nodes WHERE rownum=1;
如果Nodes
为空,则停止循环。
SET :GroupID = :GroupID + 1;
运行递归查询,从上面选择的一个特定Node
开始递归。
INSERT INTO Subgraph (Node)
WITH
CTE_Recursive (AnchorNode, Node1, Node2, NodePath, Lvl)
AS
(
SELECT
CAST(Nodes.Node AS varchar(2000)) AS AnchorNode
, Node1
, Node2
, CAST(',' || Node1 || ',' || Node2 || ',' AS varchar(2000)) AS NodePath
, 1 AS Lvl
FROM
Pairs
INNER JOIN Nodes ON Nodes.Node = Pairs.Node1
WHERE
Nodes.Node = :N -- 'A'
-- use variable here, don't know what the syntax is
UNION ALL
SELECT
CTE_Recursive.AnchorNode
, Pairs.Node1
, Pairs.Node2
, CAST(CTE_Recursive.NodePath || Pairs.Node2 || ',' AS varchar(2000)) AS NodePath
, CTE_Recursive.Lvl + 1 AS Lvl
FROM
Pairs
INNER JOIN CTE_Recursive ON CTE_Recursive.Node2 = Pairs.Node1
WHERE
CTE_Recursive.NodePath NOT LIKE CAST('%,' || Pairs.Node2 || ',%' AS varchar(2000))
)
,CTE_Result
AS
(
SELECT Node1 AS Node
FROM CTE_Recursive
UNION
SELECT Node2 AS Node
FROM CTE_Recursive
)
SELECT Node FROM CTE_Result
;
此查询将返回连接到给定起始节点的所有节点,即形成子图的那些节点。
将其结果集保存到Subgraph
临时表中。
将结果追加到最终Result
表中,并为找到的子图分配ID。
INSERT INTO Result (GroupID, Node)
SELECT :GroupID, Node
FROM Subgraph;
从Nodes
和Pairs
中删除已处理的节点。
DELETE FROM Nodes
WHERE Node IN (SELECT Node FROM Subgraph)
;
DELETE FROM Pairs
WHERE
Node1 IN (SELECT Node FROM Subgraph)
OR
Node2 IN (SELECT Node FROM Subgraph)
;
清理Subgraph
表
DELETE FROM Subgraph;
返回循环的开始。
在Result
表的最后,所有节点都具有子图的相应ID。
实际上,您可以进一步简化它。您不需要Nodes
表,Pairs
就足够了。