自上而下树postgres

时间:2017-06-19 00:13:51

标签: postgresql common-table-expression recursive-query

我正在尝试编写一个查询来生成给定根的树中所有节点的列表,以及路径(使用父级给孩子的名称)来实现这些目的。我工作的递归CTE是直接来自文档here的教科书CTE,然而,事实证明在这种情况下使路径工作很困难。

在git模型之后,由于遍历树创建的路径,父母会将名称提供给子级。 这意味着映射到git的树结构等子id。

我一直在网上寻找递归查询的解决方案,但它们似乎都包含使用父ID或物化路径的解决方案,这些都会破坏Rich Hickey's database as value所说的结构共享概念。

当前实施

想象一下,对象表很简单(为了简单起见,我们假设整数id):

drop table if exists objects;
create table objects (
    id INT,
    data jsonb
);

--       A
--     /   \
--    B     C
--   /   \    \
--  D     E    F

INSERT INTO objects (id, data) VALUES
  (1, '{"content": "data for f"}'),  -- F
  (2, '{"content": "data for e"}'),  -- E
  (3, '{"content": "data for d"}'),  -- D
  (4, '{"nodes":{"f":{"id":1}}}'),               -- C
  (5, '{"nodes":{"d":{"id":2}, "e":{"id":3}}}'), -- B
  (6, '{"nodes":{"b":{"id":5}, "c":{"id":4}}}')  -- A
  ;

drop table if exists work_tree;
create table work_tree (
    id INT NOT NULL,
    path text,
    ref text,
    data jsonb,
    primary key (ref, id) -- TODO change to ref, path
);

create or replace function get_nested_ids_array(data jsonb) returns int[] as $$
  select array_agg((value->>'id')::int) as nested_id
  from jsonb_each(data->'nodes')
$$ LANGUAGE sql STABLE;

create or replace function checkout(root_id int, ref text) returns void as $$
  with recursive nodes(id, nested_ids, data) AS (
      select id, get_nested_ids_array(data), data
      from objects
      where id = root_id
      union
      select child.id, get_nested_ids_array(child.data), child.data
      from objects child, nodes parent
      where child.id = ANY(parent.nested_ids)
  )
  INSERT INTO work_tree (id, data, ref)
  select id, data, ref from nodes
$$ language sql VOLATILE;

SELECT * FROM checkout(6, 'master');
SELECT * FROM work_tree;

如果您熟悉,这些对象的data属性看起来类似于git blobs / trees,将名称映射到id或存储内容。所以想象你想要创建一个索引,所以,在“checkout”之后,你需要查询节点列表,以及可能生成工作树或索引的路径:

当前输出:

id    path    ref          data
6     NULL    master       {"nodes":{"b":{"id":5}, "c":{"id":4}}}
4     NULL    master       {"nodes":{"d":{"id":2}, "e":{"id":3}}}
5     NULL    master       {"nodes":{"f":{"id":1}}}
1     NULL    master       {"content": "data for d"}
2     NULL    master       {"content": "data for e"}
3     NULL    master       {"content": "data for f"}

期望输出:

id    path    ref          data
6      /       master      {"nodes":{"b":{"id":5}, "c":{"id":4}}}
4      /b      master      {"nodes":{"d":{"id":2}, "e":{"id":3}}}
5      /c      master      {"nodes":{"f":{"id":1}}}
1      /b/d    master      {"content": "data for d"}
2      /b/e    master      {"content": "data for e"}
3      /c/f    master      {"content": "data for f"}

在这种情况下,聚合path的最佳方式是什么?当我进行递归查询时,我知道在调用get_nested_ids_array时我正在压缩信息,因此不确定这种自上而下的方法如何正确地与CTE聚合。

针对儿童ID的编辑用例

解释为什么我需要使用子ID而不是父代:

想象一下这样的数据结构:

      A
    /   \
   B     C
 /   \    \
D     E    F

如果您对F进行了修改,则只需添加新的根A'和子节点C'以及F',这样就可以保留旧树:

     A'    A
   /   \ /   \
  C'    B     C
 /     /   \    \
F'    D     E    F

如果您进行了删除操作,则只需添加一个仅指向A"的新根B,如果您需要定时旅行,您仍然拥有A(并且他们共享相同的对象,就像git!):

 A"  A
  \ /   \
   B     C
 /   \    \
D     E    F

所以看来实现这一目标的最佳方式是使用儿童ID,这样孩子就可以拥有多个父母 - 跨越时间和空间!如果您认为还有另一种方法可以实现这一点,请务必告诉我们!

编辑#2不使用parent_ids

的情况

使用parent_ids具有级联效果,需要编辑整个树。例如,

      A
    /   \
   B     C
 /   \    \
D     E    F

如果您对F进行了修改,则仍需要新的根A'来维护不变性。如果我们使用parent_ids,那么这意味着BC现在都有了新的父级。因此,您可以看到它如何在整个树中涟漪,需要触及每个节点:

      A              A' 
    /   \          /   \
   B     C        B'     C'
 /   \    \      /   \    \
D     E    F    D'    E'   F'

为父母提供姓名的父母编辑#3用例

我们可以进行递归查询,其中对象存储自己的名称,但我问的问题是关于构建一个路径,其中名称是从父母那里给孩子的。这是建模一个类似于git树的数据结构,例如,如果你看到下图所示的这个git图,在第3次提交中有一个树(文件夹)bak指向表示文件夹的原始根第一次提交的所有文件。如果该根对象具有自己的名称,则无法实现此目的,只需添加引用即可。这就是git的美妙之处,它就像引用哈希并给它命名一样简单。

git graph

这就是我正在建立的关系,这就是jsonb数据结构存在的原因,它是提供从名称到id的映射(在git的情况下为hash)。我知道它并不理想,但它确实提供了哈希映射。如果还有另一种方法来创建名称到id的映射,从而为父母在自上而下的树中给孩子们命名的方式,我全都听见了!

感谢任何帮助!

2 个答案:

答案 0 :(得分:3)

存储节点的父节点而不是其子节点。它是一种更简单,更清晰的解决方案,您不需要结构化数据类型。

这是一个示例模型,其数据与问题中的数据相同:

create table objects (
    id int primary key,
    parent_id int,
    label text,
    content text);

insert into objects values
(1, 4, 'f', 'data for f'),
(2, 5, 'e', 'data for e'),
(3, 5, 'd', 'data for d'),
(4, 6, 'c', ''),
(5, 6, 'b', ''),
(6, 0, 'a', '');

一个递归查询:

with recursive nodes(id, path, content) as (
    select id, label, content
    from objects
    where parent_id = 0
union all
    select o.id, concat(path, '->', label), o.content
    from objects o
    join nodes n on n.id = o.parent_id
)
select *
from nodes
order by id desc;

 id |  path   |  content   
----+---------+------------
  6 | a       | 
  5 | a->b    | 
  4 | a->c    | 
  3 | a->b->d | data for d
  2 | a->b->e | data for e
  1 | a->c->f | data for f
(6 rows)

children_ids的变体。

drop table if exists objects;
create table objects (
    id int primary key,
    children_ids int[],
    label text,
    content text);
insert into objects values
(1, null, 'f', 'data for f'),
(2, null, 'e', 'data for e'),
(3, null, 'd', 'data for d'),
(4, array[1], 'c', ''),
(5, array[2,3], 'b', ''),
(6, array[4,5], 'a', '');
with recursive nodes(id, children, path, content) as (
    select id, children_ids, label, content
    from objects
    where id = 6
union all
    select o.id, o.children_ids, concat(path, '->', label), o.content
    from objects o
    join nodes n on o.id = any(n.children)
)
select *
from nodes
order by id desc;

 id | children |  path   |  content   
----+----------+---------+------------
  6 | {4,5}    | a       | 
  5 | {2,3}    | a->b    | 
  4 | {1}      | a->c    | 
  3 |          | a->b->d | data for d
  2 |          | a->b->e | data for e
  1 |          | a->c->f | data for f
(6 rows)

答案 1 :(得分:2)

@klin's excellent answer inspired me to experiment with PostgreSQL, trees (paths), and recursive CTE! :-D

Preamble: my motivation is storing data in PostgreSQL, but visualizing those data in a graph. While the approach here has limitations (e.g. undirected edges; ...), it may otherwise be useful in other contexts.

Here, I adapted @klins code to enable CTE without a dependence on the table id, though I do use those to deal with the issue of loops in the data, e.g.

a,b
b,a

that throw the CTE into a nonterminating loop.

To solve that, I employed the rather brilliant approach suggested by @a-horse-with-no-name in SO 31739150 -- see my comments in the script, below.

PSQL script ("tree with paths.sql"):

--         File: /mnt/Vancouver/Programming/data/metabolism/practice/sql/tree_with_paths.sql
-- Adapted from: https://stackoverflow.com/questions/44620695/recursive-path-aggregation-and-cte-query-for-top-down-tree-postgres
--     See also: /mnt/Vancouver/FC/RDB - PostgreSQL/Recursive CTE - Graph Algorithms in a Database Recursive CTEs and Topological Sort with Postgres.pdf
--               https://www.fusionbox.com/blog/detail/graph-algorithms-in-a-database-recursive-ctes-and-topological-sort-with-postgres/620/

-- Run this script in psql, at the psql# prompt:
--    \! cd /mnt/Vancouver/Programming/data/metabolism/practice/sql/
--    \i /mnt/Vancouver/Programming/data/metabolism/practice/sql/tree_with_paths.sql

\c practice

DROP TABLE tree;

CREATE TABLE tree (
  -- id int primary key
  id SERIAL PRIMARY KEY
  ,s TEXT    -- s: source node
  ,t TEXT    -- t: target node
  ,UNIQUE(s, t)
);

INSERT INTO tree(s, t) VALUES
  ('a','b')
  ,('b','a')     -- << adding this 'back relation' breaks CTE_1 below, as it enters a loop and cannot terminate
  ,('b','c')
  ,('b','d')
  ,('c','e')
  ,('d','e')
  ,('e','f')
  ,('f','g')
  ,('g','h')
  ,('c','h');

SELECT * FROM tree;
-- SELECT s,t FROM tree WHERE s='b';

-- RECURSIVE QUERY 1 (CTE_1):
-- WITH RECURSIVE nodes(src, path, tgt) AS (
--     SELECT s, concat(s, '->', t), t FROM tree WHERE s = 'a'
--     -- SELECT s, concat(s, '->', t), t FROM tree WHERE s = 'c'
-- UNION ALL
--     SELECT t.s, concat(path, '->', t), t.t FROM tree t
--     JOIN nodes n ON n.tgt = t.s
-- )
-- -- SELECT * FROM nodes;
-- SELECT path FROM nodes;


-- RECURSIVE QUERY 2 (CTE_2):
-- Deals with "loops" in Postgres data, per
-- https://stackoverflow.com/questions/31739150/to-find-infinite-recursive-loop-in-cte
-- "With Postgres it's quite easy to prevent this by collecting all visited nodes in an array."
WITH RECURSIVE nodes(id, src, path, tgt) AS (
    SELECT id, s, concat(s, '->', t), t
    ,array[id] as all_parent_ids
     FROM tree WHERE s = 'a'
UNION ALL
    SELECT t.id, t.s, concat(path, '->', t), t.t, all_parent_ids||t.id FROM tree t
    JOIN nodes n ON n.tgt = t.s
    AND t.id <> ALL (all_parent_ids)    -- this is the trick to exclude the endless loops
)
-- SELECT * FROM nodes;
SELECT path FROM nodes;

Script execution / output (PSQL):

# \i tree_with_paths.sql
You are now connected to database "practice" as user "victoria".
DROP TABLE
CREATE TABLE
INSERT 0 10
 id | s | t 
----+---+---
  1 | a | b
  2 | b | a
  3 | b | c
  4 | b | d
  5 | c | e
  6 | d | e
  7 | e | f
  8 | f | g
  9 | g | h
 10 | c | h

        path         
---------------------
 a->b
 a->b->a
 a->b->c
 a->b->d
 a->b->c->e
 a->b->d->e
 a->b->c->h
 a->b->d->e->f
 a->b->c->e->f
 a->b->c->e->f->g
 a->b->d->e->f->g
 a->b->d->e->f->g->h
 a->b->c->e->f->g->h

NetworkX visualization

You can change the starting node (e.g. start at node "d") in the SQL script -- giving, e.g.:

# \i tree_with_paths.sql
...
     path      
---------------
 d->e
 d->e->f
 d->e->f->g
 d->e->f->g->h

Network visualization:

I exported those data (at the PSQL prompt) to a CSV,

# \copy (SELECT s, t FROM tree) TO '/tmp/tree.csv' WITH CSV
  COPY 9

# \! cat /tmp/tree.csv
  a,b
  b,c
  b,d
  c,e
  d,e
  e,f
  f,g
  g,h
  c,h

... which I visualized (image above) in a Python 3.5 venv:

>>> import networkx as nx
>>> import pylab as plt
>>> G = nx.read_edgelist("/tmp/tree.csv", delimiter=",")
>>> G.nodes()
['b', 'a', 'd', 'f', 'c', 'h', 'g', 'e']
>>> G.edges()
[('b', 'a'), ('b', 'd'), ('b', 'c'), ('d', 'e'), ('f', 'g'), ('f', 'e'), ('c', 'e'), ('c', 'h'), ('h', 'g')]
>>> G.number_of_nodes()
8
>>> G.number_of_edges()
9
>>> from networkx.drawing.nx_agraph import graphviz_layout
## There is a bug in Python or NetworkX: you may need to run this
## command 2x, as you may get an error the first time:
>>> nx.draw(G, pos=graphviz_layout(G), node_size=1200, node_color='lightblue', linewidths=0.25, font_size=10, font_weight='bold', with_labels=True)
>>> plt.show()
>>> nx.dijkstra_path(G, 'a', 'h')
  ['a', 'b', 'c', 'h']
>>> nx.dijkstra_path(G, 'a', 'f')
  ['a', 'b', 'd', 'e', 'f']

Note that the dijkstra_path returned from NetworkX is one of several possible, whereas all paths are returned by the Postgres CTE in a visually-appealing manner.