如何递归地找到两个表之间的相交地理位置

时间:2017-01-27 17:19:21

标签: sql postgresql gis postgis sqlgeography

我正在运行Postgres 9.6.1和PostGIS 2.3.0 r15146并且有两个表 geographies可能有150,000,000行,paths可能有10,000,000行:

CREATE TABLE paths (id uuid NOT NULL, path path NOT NULL, PRIMARY KEY (id))
CREATE TABLE geographies (id uuid NOT NULL, geography geography NOT NULL, PRIMARY KEY (id))

给定表ids的数组/ geographies,找到所有相交路径和几何的“最佳”方法是什么?

换句话说,如果初始geography有相应的交叉path,我们还需要找到所有其他 geographies这个path相交。从那里,我们需要找到这些新发现的paths相交的所有其他geographies,依此类推,直到我们找到所有可能的交叉点。

初始地理标识(我们的输入)可能在0到700之间。平均值大约为40 最小交叉点将为0,最大值约为1000.平均值可能在20左右,通常小于100连接。

我已经创建了一个这样做的功能,但我是PostGIS中的GIS和Postgres的新手。我发布了my solution as an answer to this question

我觉得应该有比我提出的更有说服力,更快捷的方式。

3 个答案:

答案 0 :(得分:9)

Your function可以 从根本上 简化。

设置

我建议您将列paths.path转换为数据类型geography(或至少geometry)。 path is a native Postgres type并且与PostGIS功能和空间索引不匹配。您必须投射path::geometrypath::geometry::geographyresulting in a LINESTRING internally)才能使其与ST_Intersects()等PostGIS功能一起使用。

我的回答是基于这些改编的表格:

CREATE TABLE paths (
   id uuid PRIMARY KEY
 , path geography NOT NULL
);

CREATE TABLE geographies (
   id uuid PRIMARY KEY
 , geography geography NOT NULL
 , fk_id text NOT NULL
);

对于两个列,一切都适用于数据类型geometrygeography通常更准确,但也更昂贵。哪个用? Read the PostGIS FAQ here.

解决方案1:您的功能已优化

CREATE OR REPLACE FUNCTION public.function_name(_fk_ids text[])
  RETURNS TABLE(id uuid, type text) AS
$func$
DECLARE
   _row_ct int;
   _loop_ct int := 0;

BEGIN
   CREATE TEMP TABLE _geo ON COMMIT DROP AS  -- dropped at end of transaction
   SELECT DISTINCT ON (g.id) g.id, g.geography, _loop_ct AS loop_ct -- dupes possible?
   FROM   geographies g
   WHERE  g.fk_id = ANY(_fk_ids);

   GET DIAGNOSTICS _row_ct = ROW_COUNT;

   IF _row_ct = 0 THEN  -- no rows found, return empty result immediately
      RETURN;           -- exit function
   END IF;

   CREATE TEMP TABLE _path ON COMMIT DROP AS
   SELECT DISTINCT ON (p.id) p.id, p.path, _loop_ct AS loop_ct
   FROM   _geo  g
   JOIN   paths p ON ST_Intersects(g.geography, p.path);  -- no dupes yet

   GET DIAGNOSTICS _row_ct = ROW_COUNT;

   IF _row_ct = 0 THEN  -- no rows found, return _geo immediately
      RETURN QUERY SELECT g.id, text 'geo' FROM _geo g;
      RETURN;   
   END IF;

   ALTER TABLE _geo  ADD CONSTRAINT g_uni UNIQUE (id);  -- required for UPSERT
   ALTER TABLE _path ADD CONSTRAINT p_uni UNIQUE (id);

   LOOP
      _loop_ct := _loop_ct + 1;

      INSERT INTO _geo(id, geography, loop_ct)
      SELECT DISTINCT ON (g.id) g.id, g.geography, _loop_ct
      FROM   _paths      p
      JOIN   geographies g ON ST_Intersects(g.geography, p.path)
      WHERE  p.loop_ct = _loop_ct - 1   -- only use last round!
      ON     CONFLICT ON CONSTRAINT g_uni DO NOTHING;  -- eliminate new dupes

      EXIT WHEN NOT FOUND;

      INSERT INTO _path(id, path, loop_ct)
      SELECT DISTINCT ON (p.id) p.id, p.path, _loop_ct
      FROM   _geo  g
      JOIN   paths p ON ST_Intersects(g.geography, p.path)
      WHERE  g.loop_ct = _loop_ct - 1
      ON     CONFLICT ON CONSTRAINT p_uni DO NOTHING;

      EXIT WHEN NOT FOUND;
   END LOOP;

   RETURN QUERY
   SELECT g.id, text 'geo'  FROM _geo g
   UNION ALL
   SELECT p.id, text 'path' FROM _path p;

END
$func$  LANGUAGE plpgsql;

呼叫:

SELECT * FROM public.function_name('{foo,bar}');

很多 比你拥有的更快。

重点

  • 您基于整个集合查询,而不是仅对该集合的最新添加。每次循环都会变得越来越慢而不需要。我添加了一个循环计数器(loop_ct)到避免冗余工作

  • 确保geographies.geographypaths.path上有空间GiST 索引

    CREATE INDEX geo_geo_gix ON geographies USING GIST (geography);
    CREATE INDEX paths_path_gix ON paths USING GIST (path);
    

    由于Postgres 9.5 index-only scans将是GiST索引的选项。您可以添加id作为第二个索引列。好处取决于许多因素,您必须进行测试。 但是uuid类型没有适合的运算符GiST类。安装扩展程序btree_gist后,它将与bigint一起使用:

  • g.fk_id上也有一个拟合索引。同样,如果您可以从中获取仅索引扫描,则(fk_id, id, geography)上的多列索引可能会付费。默认的btree索引fk_id必须是第一个索引列。特别是如果您经常运行查询并且很少更新表,并且表行比索引宽得多。

  • 您可以在声明时初始化变量。重写后只需要一次。

  • ON COMMIT DROP 会自动删除事务结束时的临时表。所以我明确地删除了丢弃表。但是,如果您在相同的事务中调用该函数两次,则会出现异常。在函数中,我将检查是否存在临时表,在这种情况下使用TRUNCATE。相关:

  • 使用GET DIAGNOSTICS获取行计数,而不是运行另一个计数查询。

    重写后你根本不需要数数。便宜地检查 FOUND 就足够了。
    实际上,您需要GET DIAGNOSTICSCREATE TABLE未设置FOUND(如手册中所列)。我的原始(已测试)函数中有INSERT 设置FOUND,因此疏忽了。现在修复。

  • 在填充表格后,添加索引或PK / UNIQUE约束会更快。而不是在我们真正需要它之前。

  • 自从Postgres 9.5以来,
  • ON CONFLICT ... DO ... 是UPSERT更简单,更便宜的方式。

    对于命令的简单形式,您只需列出索引列或表达式(如ON CONFLICT (id) DO ...),然后让Postgres执行唯一索引推断以确定仲裁器约束或索引。我后来通过直接提供约束进行了优化。但为此我们需要一个实际的约束 - 一个唯一的索引是不够的。相应修正。 Details in the manual here.

  • 可以手动帮助ANALYZE临时表,以帮助Postgres找到最佳查询计划。 (但我认为你不需要它。)

  • _geo_ct - _geographyLength > 0_geo_ct > _geographyLength是一种尴尬且更昂贵的方式。但现在已经完全消失了。

  • 不要引用语言名称。只需LANGUAGE plpgsql

  • 对于varchar[]数组,函数参数fk_id,但您稍后评论过:

      

    这是一个bigint字段,代表一个地理区域(它实际上是15级的预计算s2cell ID。)

    我在15级上不知道 s2cell id,但理想情况下,您会传递匹配数据类型的数组,或者如果这不是默认的选项text[]

    此外,您评论过:

      

    总共传递了13个fk_id

    这似乎是 VARIADIC 函数参数的完美用例。所以你的函数定义是:

    CREATE OR REPLACE FUNCTION public.function_name(_fk_ids VARIADIC text[]) ...

    详细说明:

解决方案2:具有递归CTE的纯SQL

很难将rCTE包裹在两个交替循环周围,但可能有一些SQL技巧:

WITH RECURSIVE cte AS (
   SELECT g.id, g.geography::text, NULL::text AS path, text 'geo' AS type
   FROM   geographies g
   WHERE  g.fk_id = ANY($kf_ids)  -- your input array here

   UNION
   SELECT p.id, g.geography::text, p.path::text
        , CASE WHEN p.path IS NULL THEN 'geo' ELSE 'path' END AS type
   FROM   cte              c
   LEFT   JOIN paths       p ON c.type = 'geo'
                            AND ST_Intersects(c.geography::geography, p.path)
   LEFT   JOIN geographies g ON c.type = 'path'
                            AND ST_Intersects(g.geography, c.path::geography)
   WHERE (p.path IS NOT NULL OR g.geography IS NOT NULL)
   )
SELECT id, type FROM cte;

这就是全部。您需要与上面相同的索引。您可以将其包装到SQL或PL / pgSQL函数中以供重复使用。

主要附加点

  • 投放到text是必要的,因为geography类型不是" hashable" (geometry)相同。 (See this open PostGIS issue for details.)通过转换为text解决此问题。仅凭(id, type)行就是唯一的,我们可以忽略geography列。返回geography进行加入。不应该花费太多额外费用。

  • 我们需要两个LEFT JOIN所以不要排除行,因为在每次迭代时,两个表中只有一个可能会贡献更多的行。
    最后的条件确保我们还没有完成,但是:

    WHERE (p.path IS NOT NULL OR g.geography IS NOT NULL)
    

    这是有效的,因为重复的发现被排除在临时 中间表。 The manual:

      

    对于UNION(但不是UNION ALL),丢弃重复的行和行   复制任何以前的结果行。包括所有剩余的行   递归查询的结果,并将它们放在临时   中间表。

那么哪个更快?

rCTE可能比小结果集的功能更快。函数中的临时表和索引意味着更多的开销。但是,对于大型结果集,函数可能更快。只有使用您的实际设置进行测试才能给您一个明确的答案。*
 *请参阅the OP's feedback in the comment

答案 1 :(得分:3)

我认为即使它不是最佳的,也可以在这里发布我自己的解决方案。

这是我提出的(使用Steve Chambers的建议):

CREATE OR REPLACE FUNCTION public.function_name(
    _fk_ids character varying[])
    RETURNS TABLE(id uuid, type character varying)
    LANGUAGE 'plpgsql'
    COST 100.0
    VOLATILE
    ROWS 1000.0
AS $function$

    DECLARE
        _pathLength bigint;
        _geographyLength bigint;

        _currentPathLength bigint;
        _currentGeographyLength bigint;
    BEGIN
        DROP TABLE IF EXISTS _pathIds;
        DROP TABLE IF EXISTS _geographyIds;
        CREATE TEMPORARY TABLE _pathIds (id UUID PRIMARY KEY);
        CREATE TEMPORARY TABLE _geographyIds (id UUID PRIMARY KEY);

        -- get all geographies in the specified _fk_ids
        INSERT INTO _geographyIds
            SELECT g.id
            FROM geographies g
            WHERE g.fk_id= ANY(_fk_ids);

        _pathLength := 0;
        _geographyLength := 0;
        _currentPathLength := 0;
        _currentGeographyLength := (SELECT COUNT(_geographyIds.id) FROM _geographyIds);
        -- _pathIds := ARRAY[]::uuid[];

        WHILE (_currentPathLength - _pathLength > 0) OR (_currentGeographyLength - _geographyLength > 0) LOOP
            _pathLength := (SELECT COUNT(_pathIds.id) FROM _pathIds);
            _geographyLength := (SELECT COUNT(_geographyIds.id) FROM _geographyIds);

            -- gets all paths that have paths that intersect the geographies that aren't in the current list of path ids

            INSERT INTO _pathIds 
                SELECT DISTINCT p.id
                    FROM paths p
                    JOIN geographies g ON ST_Intersects(g.geography, p.path)
                    WHERE
                        g.id IN (SELECT _geographyIds.id FROM _geographyIds) AND
                        p.id NOT IN (SELECT _pathIds.id from _pathIds);

            -- gets all geographies that have paths that intersect the paths that aren't in the current list of geography ids
            INSERT INTO _geographyIds 
                SELECT DISTINCT g.id
                    FROM geographies g
                    JOIN paths p ON ST_Intersects(g.geography, p.path)
                    WHERE
                        p.id IN (SELECT _pathIds.id FROM _pathIds) AND
                        g.id NOT IN (SELECT _geographyIds.id FROM _geographyIds);

            _currentPathLength := (SELECT COUNT(_pathIds.id) FROM _pathIds);
            _currentGeographyLength := (SELECT COUNT(_geographyIds.id) FROM _geographyIds);
        END LOOP;

        RETURN QUERY
            SELECT _geographyIds.id, 'geography' AS type FROM _geographyIds
            UNION ALL
            SELECT _pathIds.id, 'path' AS type FROM _pathIds;
    END;

$function$;

答案 2 :(得分:1)

Sample plot and data from this script Sample plot

它可以是具有聚合函数的纯关系。此实现使用一个path表和一个point表。两者都是几何形状。与通用地理相比,使用和测试更容易创建测试数据,但它应该很容易适应。

create table path (
    path_text text primary key,
    path geometry(linestring) not null
);
create table point (
   point_text text primary key,
   point geometry(point) not null
);

保持聚合函数状态的类型:

create type mpath_mpoint as (
    mpath geometry(multilinestring),
    mpoint geometry(multipoint)
);

国家建设职能:

create or replace function path_point_intersect (
    _i mpath_mpoint[], _e mpath_mpoint
) returns mpath_mpoint[] as $$

    with e as (select (e).mpath, (e).mpoint from (values (_e)) e (e)),
    i as (select mpath, mpoint from unnest(_i) i (mpath, mpoint))
    select array_agg((mpath, mpoint)::mpath_mpoint)
    from (
        select
            st_multi(st_union(i.mpoint, e.mpoint)) as mpoint,
            (
                select st_collect(gd)
                from (
                    select gd from st_dump(i.mpath) a (a, gd)
                    union all
                    select gd from st_dump(e.mpath) b (a, gd)
                ) s
            ) as mpath
        from i inner join e on st_intersects(i.mpoint, e.mpoint)

        union all
        select i.mpoint, i.mpath
        from i inner join e on not st_intersects(i.mpoint, e.mpoint)

        union all
        select e.mpoint, e.mpath
        from e
        where not exists (
            select 1 from i
            where st_intersects(i.mpoint, e.mpoint)
        )
    ) s;
$$ language sql;

汇总:

create aggregate path_point_agg (mpath_mpoint) (
    sfunc = path_point_intersect,
    stype = mpath_mpoint[]
);

此查询将返回一组包含匹配路径/点的multilinestring, multipoint字符串:

select st_astext(mpath), st_astext(mpoint)
from unnest((
    select path_point_agg((st_multi(path), st_multi(mpoint))::mpath_mpoint)
    from (
        select path, st_union(point) as mpoint
        from
            path 
            inner join
            point on st_intersects(path, point)
        group by path
    ) s
)) m (mpath, mpoint)
;
                         st_astext                         |          st_astext          
-----------------------------------------------------------+-----------------------------
 MULTILINESTRING((-10 0,10 0,8 3),(0 -10,0 10),(2 1,4 -1)) | MULTIPOINT(0 0,0 5,3 0,5 0)
 MULTILINESTRING((-9 -8,4 -8),(-8 -9,-8 6))                | MULTIPOINT(-8 -8,2 -8)
 MULTILINESTRING((-7 -4,-3 4,-5 6))                        | MULTIPOINT(-6 -2)