我的 PostgreSQL 数据库中有一个图表,为了举个例子我们定义它:
CREATE TABLE nodes (node_id INTEGER);
CREATE TABLE roads (road_id INTEGER, nodes INTEGER[]);
INSERT INTO nodes VALUES (1), (2), (3), (4), (5);
INSERT INTO roads VALUES (1, {1, 2}), (2, {3, 4}));
我想创建返回图表connected components的数量的SQL查询,在此示例中,数字 3 ,因为节点1/2连接,3/4为好吧,虽然5没有连接任何东西。
我尝试在SQL中搜索find&union实现但无济于事,然后我转向CTEs但我不能自己做,我想的是这样的事情:
WITH RECURSIVE cc(iterator_id, node_id, rank, iterator) AS
(
SELECT row_number() OVER(), n.node_id, row_number() OVER (), 1 FROM nodes AS n
UNION ALL
# Something here that does the magic
)
SELECT
COUNT(DISTINCT rank) AS no_of_cc
FROM
cc,
(SELECT COUNT(*) FROM nodes) AS last_iterator_id
WHERE iterator = last_iterator_id;
在每次迭代中,我们更新iterator_id< = iterator的行的行。我们迭代,直到iterator
等于最大iterator_id
但我想不出递归部分。
你能帮我找到连接组件的数量吗?
答案 0 :(得分:1)
RECURSIVE CTE
。
WITH RECURSIVE graph_search(node_id, connected_to, path, cycle) AS (
SELECT node_id, connected_to, ARRAY[node_id], false FROM paths
UNION
SELECT p.node_id, p.connected_to, gs.path || p.node_id, p.node_id=ANY(gs.path)
FROM graph_search gs JOIN paths p ON gs.connected_to = p.node_id AND NOT gs.cycle
),
paths AS (
SELECT node_id, connected_to
FROM (
SELECT n.node_id, unnest(r.nodes) AS connected_to
FROM nodes n JOIN roads r ON n.node_id = ANY(r.nodes)
) sub
WHERE node_id <> connected_to
)
SELECT count(DISTINCT component)
FROM (
SELECT node_id,
array_agg(DISTINCT reachable_node ORDER BY reachable_node) as component
FROM (
SELECT node_id, unnest(path) as reachable_node from graph_search
) sub
GROUP BY node_id
UNION ALL /*need to append lonely nodes - they are components for themselves*/
SELECT node_id, ARRAY[node_id]
FROM nodes
WHERE node_id NOT IN (SELECT node_id from paths)
) sub;
CTE
的普通paths
创建了带有成对连接节点的双列表。答案 1 :(得分:0)
如果节点数太大,上述解决方案将不起作用。
最有效的解决方案(只要您有足够的RAM来读取所有数据)是使用C或C ++等语言将数据读取到内存中并在其中执行计算。
但是,如果数据大小太大而您别无选择,那么您可以这样做:
(plpgssql实现,假设我们有表路(node1,node2))
CREATE TABLE node AS
SELECT DISTINCT node1 AS id, node1 AS color
FROM roads
CREATE OR REPLACE FUNCTION merge_node()
RETURNS VOID
AS
$$
DECLARE
left_to_do INT := 1;
counter INT :=1;
row record;
BEGIN
DROP TABLE IF EXISTS t;
CREATE TEMP TABLE t (
node1 INT,
prev INT,
next INT
);
WHILE left_to_do > 0
LOOP
WITH joined_table AS (
SELECT roads.node1,
MAX (v1.color) AS prev,
MAX (v2.color) AS next
FROM roads
JOIN node v1 ON roads.node1 = v1.id
JOIN node v2 ON roads.node2 = v2.id
GROUP BY roads.node1
)
INSERT INTO t (node1, prev, next)
SELECT node1,
prev,
next
FROM joined_table
WHERE prev < next;
SELECT COUNT(*) INTO left_to_do FROM t;
UPDATE node color
SET color = t.next
FROM t
WHERE color.id = t.node1;
DELETE FROM t;
counter := counter + 1;
END LOOP;
END;
$$
LANGUAGE plpgsql;
如果节点度数比节点数低,这应该会更好。 在带有240万个节点和2400万个边缘的图形上对其进行了测试,并用了大约30-60分钟的索引时间。 (相比之下,在C ++中,它花费2.5分钟的时间大部分时间是从csv读取数据/将数据写入csv)