Question

我正在运行Postgres 9.6.1和PostGIS 2.3.0 r15146并且有两个表 geographies可能有150,000,000行，paths可能有10,000,000行：

CREATE TABLE paths (id uuid NOT NULL, path path NOT NULL, PRIMARY KEY (id))
CREATE TABLE geographies (id uuid NOT NULL, geography geography NOT NULL, PRIMARY KEY (id))

给定表ids的数组/ geographies，找到所有相交路径和几何的“最佳”方法是什么？

换句话说，如果初始geography有相应的交叉path，我们还需要找到所有其他 geographies这个path相交。从那里，我们需要找到这些新发现的paths相交的所有其他geographies，依此类推，直到我们找到所有可能的交叉点。

初始地理标识（我们的输入）可能在0到700之间。平均值大约为40 最小交叉点将为0，最大值约为1000.平均值可能在20左右，通常小于100连接。

我已经创建了一个这样做的功能，但我是PostGIS中的GIS和Postgres的新手。我发布了my solution as an answer to this question。

我觉得应该有比我提出的更有说服力，更快捷的方式。

Answer 1

Your function可以 从根本上 简化。

设置

我建议您将列paths.path转换为数据类型geography（或至少geometry）。 path is a native Postgres type并且与PostGIS功能和空间索引不匹配。您必须投射path::geometry或path::geometry::geography（resulting in a LINESTRING internally）才能使其与ST_Intersects()等PostGIS功能一起使用。

我的回答是基于这些改编的表格：

CREATE TABLE paths (
   id uuid PRIMARY KEY
 , path geography NOT NULL
);

CREATE TABLE geographies (
   id uuid PRIMARY KEY
 , geography geography NOT NULL
 , fk_id text NOT NULL
);

对于两个列，一切都适用于数据类型geometry。 geography通常更准确，但也更昂贵。哪个用？ Read the PostGIS FAQ here.

解决方案1：您的功能已优化

CREATE OR REPLACE FUNCTION public.function_name(_fk_ids text[])
  RETURNS TABLE(id uuid, type text) AS
$func$
DECLARE
   _row_ct int;
   _loop_ct int := 0;

BEGIN
   CREATE TEMP TABLE _geo ON COMMIT DROP AS  -- dropped at end of transaction
   SELECT DISTINCT ON (g.id) g.id, g.geography, _loop_ct AS loop_ct -- dupes possible?
   FROM   geographies g
   WHERE  g.fk_id = ANY(_fk_ids);

   GET DIAGNOSTICS _row_ct = ROW_COUNT;

   IF _row_ct = 0 THEN  -- no rows found, return empty result immediately
      RETURN;           -- exit function
   END IF;

   CREATE TEMP TABLE _path ON COMMIT DROP AS
   SELECT DISTINCT ON (p.id) p.id, p.path, _loop_ct AS loop_ct
   FROM   _geo  g
   JOIN   paths p ON ST_Intersects(g.geography, p.path);  -- no dupes yet

   GET DIAGNOSTICS _row_ct = ROW_COUNT;

   IF _row_ct = 0 THEN  -- no rows found, return _geo immediately
      RETURN QUERY SELECT g.id, text 'geo' FROM _geo g;
      RETURN;   
   END IF;

   ALTER TABLE _geo  ADD CONSTRAINT g_uni UNIQUE (id);  -- required for UPSERT
   ALTER TABLE _path ADD CONSTRAINT p_uni UNIQUE (id);

   LOOP
      _loop_ct := _loop_ct + 1;

      INSERT INTO _geo(id, geography, loop_ct)
      SELECT DISTINCT ON (g.id) g.id, g.geography, _loop_ct
      FROM   _paths      p
      JOIN   geographies g ON ST_Intersects(g.geography, p.path)
      WHERE  p.loop_ct = _loop_ct - 1   -- only use last round!
      ON     CONFLICT ON CONSTRAINT g_uni DO NOTHING;  -- eliminate new dupes

      EXIT WHEN NOT FOUND;

      INSERT INTO _path(id, path, loop_ct)
      SELECT DISTINCT ON (p.id) p.id, p.path, _loop_ct
      FROM   _geo  g
      JOIN   paths p ON ST_Intersects(g.geography, p.path)
      WHERE  g.loop_ct = _loop_ct - 1
      ON     CONFLICT ON CONSTRAINT p_uni DO NOTHING;

      EXIT WHEN NOT FOUND;
   END LOOP;

   RETURN QUERY
   SELECT g.id, text 'geo'  FROM _geo g
   UNION ALL
   SELECT p.id, text 'path' FROM _path p;

END
$func$  LANGUAGE plpgsql;

呼叫：

SELECT * FROM public.function_name('{foo,bar}');

很多比你拥有的更快。

重点

您基于整个集合查询，而不是仅对该集合的最新添加。每次循环都会变得越来越慢而不需要。我添加了一个循环计数器（loop_ct）到避免冗余工作。
确保geographies.geography和paths.path上有空间GiST 索引：
```
CREATE INDEX geo_geo_gix ON geographies USING GIST (geography);
CREATE INDEX paths_path_gix ON paths USING GIST (path);
```
由于Postgres 9.5 index-only scans将是GiST索引的选项。您可以添加id作为第二个索引列。好处取决于许多因素，您必须进行测试。但是，uuid类型没有适合的运算符GiST类。安装扩展程序btree_gist后，它将与bigint一起使用：
- Postgres multi-column index (integer, boolean, and array)
- Multicolumn index on 3 fields with heterogenous data types
在g.fk_id上也有一个拟合索引。同样，如果您可以从中获取仅索引扫描，则(fk_id, id, geography)上的多列索引可能会付费。默认的btree索引fk_id必须是第一个索引列。特别是如果您经常运行查询并且很少更新表，并且表行比索引宽得多。
您可以在声明时初始化变量。重写后只需要一次。
ON COMMIT DROP 会自动删除事务结束时的临时表。所以我明确地删除了丢弃表。但是，如果您在相同的事务中调用该函数两次，则会出现异常。在函数中，我将检查是否存在临时表，在这种情况下使用TRUNCATE。相关：
- How to check if a table exists in a given schema
使用GET DIAGNOSTICS获取行计数，而不是运行另一个计数查询。
- Count rows affected by DELETE
~~重写后你根本不需要数数。便宜地检查 FOUND 就足够了。~~
实际上，您需要GET DIAGNOSTICS。 CREATE TABLE未设置FOUND（如手册中所列）。我的原始（已测试）函数中有INSERT 设置FOUND，因此疏忽了。现在修复。
在填充表格后，添加索引或PK / UNIQUE约束会更快。而不是在我们真正需要它之前。

自从Postgres 9.5以来，

ON CONFLICT ... DO ... 是UPSERT更简单，更便宜的方式。

How to UPSERT (MERGE, INSERT ... ON DUPLICATE UPDATE) in PostgreSQL?

对于命令的简单形式，您只需列出索引列或表达式（如ON CONFLICT (id) DO ...），然后让Postgres执行唯一索引推断以确定仲裁器约束或索引。我后来通过直接提供约束进行了优化。但为此我们需要一个实际的约束 - 一个唯一的索引是不够的。相应修正。 Details in the manual here.

可以手动帮助ANALYZE临时表，以帮助Postgres找到最佳查询计划。（但我认为你不需要它。）

Are regular VACUUM ANALYZE still recommended under 9.1?

_geo_ct - _geographyLength > 0说_geo_ct > _geographyLength是一种尴尬且更昂贵的方式。但现在已经完全消失了。

不要引用语言名称。只需LANGUAGE plpgsql。

对于varchar[]数组，函数参数为fk_id，但您稍后评论过：

这是一个bigint字段，代表一个地理区域（它实际上是15级的预计算s2cell ID。）

我在15级上不知道 s2cell id，但理想情况下，您会传递匹配数据类型的数组，或者如果这不是默认的选项text[]。

此外，您评论过：

总共传递了13个fk_id。

这似乎是 VARIADIC 函数参数的完美用例。所以你的函数定义是：

CREATE OR REPLACE FUNCTION public.function_name(_fk_ids VARIADIC text[]) ...

详细说明：

Pass multiple values in single parameter

解决方案2：具有递归CTE的纯SQL

很难将rCTE包裹在两个交替循环周围，但可能有一些SQL技巧：

WITH RECURSIVE cte AS ( SELECT g.id, g.geography::text, NULL::text AS path, text 'geo' AS type FROM geographies g WHERE g.fk_id = ANY($kf_ids) -- your input array here UNION SELECT p.id, g.geography::text, p.path::text , CASE WHEN p.path IS NULL THEN 'geo' ELSE 'path' END AS type FROM cte c LEFT JOIN paths p ON c.type = 'geo' AND ST_Intersects(c.geography::geography, p.path) LEFT JOIN geographies g ON c.type = 'path' AND ST_Intersects(g.geography, c.path::geography) WHERE (p.path IS NOT NULL OR g.geography IS NOT NULL) ) SELECT id, type FROM cte;

这就是全部。您需要与上面相同的索引。您可以将其包装到SQL或PL / pgSQL函数中以供重复使用。

主要附加点

投放到text是必要的，因为geography类型不是＆＃34; hashable＆＃34; （geometry）相同。（See this open PostGIS issue for details.）通过转换为text解决此问题。仅凭(id, type)行就是唯一的，我们可以忽略geography列。返回geography进行加入。不应该花费太多额外费用。

我们需要两个LEFT JOIN所以不要排除行，因为在每次迭代时，两个表中只有一个可能会贡献更多的行。
最后的条件确保我们还没有完成，但是：

WHERE (p.path IS NOT NULL OR g.geography IS NOT NULL)

这是有效的，因为重复的发现被排除在临时中间表。 The manual:

对于UNION（但不是UNION ALL），丢弃重复的行和行复制任何以前的结果行。包括所有剩余的行递归查询的结果，并将它们放在临时中间表。

那么哪个更快？

rCTE可能比小结果集的功能更快。函数中的临时表和索引意味着更多的开销。但是，对于大型结果集，函数可能更快。只有使用您的实际设置进行测试才能给您一个明确的答案。*
*请参阅the OP's feedback in the comment。

Answer 2

我认为即使它不是最佳的，也可以在这里发布我自己的解决方案。

这是我提出的（使用Steve Chambers的建议）：

CREATE OR REPLACE FUNCTION public.function_name(
    _fk_ids character varying[])
    RETURNS TABLE(id uuid, type character varying)
    LANGUAGE 'plpgsql'
    COST 100.0
    VOLATILE
    ROWS 1000.0
AS $function$

    DECLARE
        _pathLength bigint;
        _geographyLength bigint;

        _currentPathLength bigint;
        _currentGeographyLength bigint;
    BEGIN
        DROP TABLE IF EXISTS _pathIds;
        DROP TABLE IF EXISTS _geographyIds;
        CREATE TEMPORARY TABLE _pathIds (id UUID PRIMARY KEY);
        CREATE TEMPORARY TABLE _geographyIds (id UUID PRIMARY KEY);

        -- get all geographies in the specified _fk_ids
        INSERT INTO _geographyIds
            SELECT g.id
            FROM geographies g
            WHERE g.fk_id= ANY(_fk_ids);

        _pathLength := 0;
        _geographyLength := 0;
        _currentPathLength := 0;
        _currentGeographyLength := (SELECT COUNT(_geographyIds.id) FROM _geographyIds);
        -- _pathIds := ARRAY[]::uuid[];

        WHILE (_currentPathLength - _pathLength > 0) OR (_currentGeographyLength - _geographyLength > 0) LOOP
            _pathLength := (SELECT COUNT(_pathIds.id) FROM _pathIds);
            _geographyLength := (SELECT COUNT(_geographyIds.id) FROM _geographyIds);

            -- gets all paths that have paths that intersect the geographies that aren't in the current list of path ids

            INSERT INTO _pathIds 
                SELECT DISTINCT p.id
                    FROM paths p
                    JOIN geographies g ON ST_Intersects(g.geography, p.path)
                    WHERE
                        g.id IN (SELECT _geographyIds.id FROM _geographyIds) AND
                        p.id NOT IN (SELECT _pathIds.id from _pathIds);

            -- gets all geographies that have paths that intersect the paths that aren't in the current list of geography ids
            INSERT INTO _geographyIds 
                SELECT DISTINCT g.id
                    FROM geographies g
                    JOIN paths p ON ST_Intersects(g.geography, p.path)
                    WHERE
                        p.id IN (SELECT _pathIds.id FROM _pathIds) AND
                        g.id NOT IN (SELECT _geographyIds.id FROM _geographyIds);

            _currentPathLength := (SELECT COUNT(_pathIds.id) FROM _pathIds);
            _currentGeographyLength := (SELECT COUNT(_geographyIds.id) FROM _geographyIds);
        END LOOP;

        RETURN QUERY
            SELECT _geographyIds.id, 'geography' AS type FROM _geographyIds
            UNION ALL
            SELECT _pathIds.id, 'path' AS type FROM _pathIds;
    END;

$function$;

Answer 3

Sample plot and data from this script

它可以是具有聚合函数的纯关系。此实现使用一个path表和一个point表。两者都是几何形状。与通用地理相比，使用和测试更容易创建测试数据，但它应该很容易适应。

create table path (
    path_text text primary key,
    path geometry(linestring) not null
);
create table point (
   point_text text primary key,
   point geometry(point) not null
);

保持聚合函数状态的类型：

create type mpath_mpoint as (
    mpath geometry(multilinestring),
    mpoint geometry(multipoint)
);

国家建设职能：

create or replace function path_point_intersect (
    _i mpath_mpoint[], _e mpath_mpoint
) returns mpath_mpoint[] as $$

    with e as (select (e).mpath, (e).mpoint from (values (_e)) e (e)),
    i as (select mpath, mpoint from unnest(_i) i (mpath, mpoint))
    select array_agg((mpath, mpoint)::mpath_mpoint)
    from (
        select
            st_multi(st_union(i.mpoint, e.mpoint)) as mpoint,
            (
                select st_collect(gd)
                from (
                    select gd from st_dump(i.mpath) a (a, gd)
                    union all
                    select gd from st_dump(e.mpath) b (a, gd)
                ) s
            ) as mpath
        from i inner join e on st_intersects(i.mpoint, e.mpoint)

        union all
        select i.mpoint, i.mpath
        from i inner join e on not st_intersects(i.mpoint, e.mpoint)

        union all
        select e.mpoint, e.mpath
        from e
        where not exists (
            select 1 from i
            where st_intersects(i.mpoint, e.mpoint)
        )
    ) s;
$$ language sql;

汇总：

create aggregate path_point_agg (mpath_mpoint) (
    sfunc = path_point_intersect,
    stype = mpath_mpoint[]
);

此查询将返回一组包含匹配路径/点的multilinestring, multipoint字符串：

select st_astext(mpath), st_astext(mpoint)
from unnest((
    select path_point_agg((st_multi(path), st_multi(mpoint))::mpath_mpoint)
    from (
        select path, st_union(point) as mpoint
        from
            path 
            inner join
            point on st_intersects(path, point)
        group by path
    ) s
)) m (mpath, mpoint)
;
                         st_astext                         |          st_astext          
-----------------------------------------------------------+-----------------------------
 MULTILINESTRING((-10 0,10 0,8 3),(0 -10,0 10),(2 1,4 -1)) | MULTIPOINT(0 0,0 5,3 0,5 0)
 MULTILINESTRING((-9 -8,4 -8),(-8 -9,-8 6))                | MULTIPOINT(-8 -8,2 -8)
 MULTILINESTRING((-7 -4,-3 4,-5 6))                        | MULTIPOINT(-6 -2)

如何递归地找到两个表之间的相交地理位置

3 个答案:

设置

解决方案1：您的功能已优化

重点

解决方案2：具有递归CTE的纯SQL

主要附加点

那么哪个更快？