我正在运行Postgres 9.6.1和PostGIS 2.3.0 r15146并且有两个表
geographies
可能有150,000,000行,paths
可能有10,000,000行:
CREATE TABLE paths (id uuid NOT NULL, path path NOT NULL, PRIMARY KEY (id))
CREATE TABLE geographies (id uuid NOT NULL, geography geography NOT NULL, PRIMARY KEY (id))
给定表ids
的数组/ geographies
,找到所有相交路径和几何的“最佳”方法是什么?
换句话说,如果初始geography
有相应的交叉path
,我们还需要找到所有其他 geographies
这个path
相交。从那里,我们需要找到这些新发现的paths
相交的所有其他geographies
,依此类推,直到我们找到所有可能的交叉点。
初始地理标识(我们的输入)可能在0到700之间。平均值大约为40 最小交叉点将为0,最大值约为1000.平均值可能在20左右,通常小于100连接。
我已经创建了一个这样做的功能,但我是PostGIS中的GIS和Postgres的新手。我发布了my solution as an answer to this question。
我觉得应该有比我提出的更有说服力,更快捷的方式。
答案 0 :(得分:9)
Your function可以 从根本上 简化。
我建议您将列paths.path
转换为数据类型geography
(或至少geometry
)。 path
is a native Postgres type并且与PostGIS功能和空间索引不匹配。您必须投射path::geometry
或path::geometry::geography
(resulting in a LINESTRING
internally)才能使其与ST_Intersects()
等PostGIS功能一起使用。
我的回答是基于这些改编的表格:
CREATE TABLE paths (
id uuid PRIMARY KEY
, path geography NOT NULL
);
CREATE TABLE geographies (
id uuid PRIMARY KEY
, geography geography NOT NULL
, fk_id text NOT NULL
);
对于两个列,一切都适用于数据类型geometry
。 geography
通常更准确,但也更昂贵。哪个用? Read the PostGIS FAQ here.
CREATE OR REPLACE FUNCTION public.function_name(_fk_ids text[])
RETURNS TABLE(id uuid, type text) AS
$func$
DECLARE
_row_ct int;
_loop_ct int := 0;
BEGIN
CREATE TEMP TABLE _geo ON COMMIT DROP AS -- dropped at end of transaction
SELECT DISTINCT ON (g.id) g.id, g.geography, _loop_ct AS loop_ct -- dupes possible?
FROM geographies g
WHERE g.fk_id = ANY(_fk_ids);
GET DIAGNOSTICS _row_ct = ROW_COUNT;
IF _row_ct = 0 THEN -- no rows found, return empty result immediately
RETURN; -- exit function
END IF;
CREATE TEMP TABLE _path ON COMMIT DROP AS
SELECT DISTINCT ON (p.id) p.id, p.path, _loop_ct AS loop_ct
FROM _geo g
JOIN paths p ON ST_Intersects(g.geography, p.path); -- no dupes yet
GET DIAGNOSTICS _row_ct = ROW_COUNT;
IF _row_ct = 0 THEN -- no rows found, return _geo immediately
RETURN QUERY SELECT g.id, text 'geo' FROM _geo g;
RETURN;
END IF;
ALTER TABLE _geo ADD CONSTRAINT g_uni UNIQUE (id); -- required for UPSERT
ALTER TABLE _path ADD CONSTRAINT p_uni UNIQUE (id);
LOOP
_loop_ct := _loop_ct + 1;
INSERT INTO _geo(id, geography, loop_ct)
SELECT DISTINCT ON (g.id) g.id, g.geography, _loop_ct
FROM _paths p
JOIN geographies g ON ST_Intersects(g.geography, p.path)
WHERE p.loop_ct = _loop_ct - 1 -- only use last round!
ON CONFLICT ON CONSTRAINT g_uni DO NOTHING; -- eliminate new dupes
EXIT WHEN NOT FOUND;
INSERT INTO _path(id, path, loop_ct)
SELECT DISTINCT ON (p.id) p.id, p.path, _loop_ct
FROM _geo g
JOIN paths p ON ST_Intersects(g.geography, p.path)
WHERE g.loop_ct = _loop_ct - 1
ON CONFLICT ON CONSTRAINT p_uni DO NOTHING;
EXIT WHEN NOT FOUND;
END LOOP;
RETURN QUERY
SELECT g.id, text 'geo' FROM _geo g
UNION ALL
SELECT p.id, text 'path' FROM _path p;
END
$func$ LANGUAGE plpgsql;
呼叫:
SELECT * FROM public.function_name('{foo,bar}');
很多 比你拥有的更快。
您基于整个集合查询,而不是仅对该集合的最新添加。每次循环都会变得越来越慢而不需要。我添加了一个循环计数器(loop_ct
)到避免冗余工作。
确保geographies.geography
和paths.path
上有空间GiST 索引:
CREATE INDEX geo_geo_gix ON geographies USING GIST (geography);
CREATE INDEX paths_path_gix ON paths USING GIST (path);
由于Postgres 9.5 index-only scans将是GiST索引的选项。您可以添加id
作为第二个索引列。好处取决于许多因素,您必须进行测试。 但是,uuid
类型没有适合的运算符GiST类。安装扩展程序btree_gist后,它将与bigint
一起使用:
在g.fk_id
上也有一个拟合索引。同样,如果您可以从中获取仅索引扫描,则(fk_id, id, geography)
上的多列索引可能会付费。默认的btree索引fk_id
必须是第一个索引列。特别是如果您经常运行查询并且很少更新表,并且表行比索引宽得多。
您可以在声明时初始化变量。重写后只需要一次。
ON COMMIT DROP
会自动删除事务结束时的临时表。所以我明确地删除了丢弃表。但是,如果您在相同的事务中调用该函数两次,则会出现异常。在函数中,我将检查是否存在临时表,在这种情况下使用TRUNCATE
。相关:
使用GET DIAGNOSTICS
获取行计数,而不是运行另一个计数查询。
重写后你根本不需要数数。便宜地检查 FOUND
就足够了。
实际上,您需要GET DIAGNOSTICS
。 CREATE TABLE
未设置FOUND
(如手册中所列)。我的原始(已测试)函数中有INSERT
设置FOUND
,因此疏忽了。现在修复。
在填充表格后,添加索引或PK / UNIQUE约束会更快。而不是在我们真正需要它之前。
ON CONFLICT ... DO ...
是UPSERT更简单,更便宜的方式。
对于命令的简单形式,您只需列出索引列或表达式(如ON CONFLICT (id) DO ...
),然后让Postgres执行唯一索引推断以确定仲裁器约束或索引。我后来通过直接提供约束进行了优化。但为此我们需要一个实际的约束 - 一个唯一的索引是不够的。相应修正。 Details in the manual here.
可以手动帮助ANALYZE
临时表,以帮助Postgres找到最佳查询计划。 (但我认为你不需要它。)
_geo_ct - _geographyLength > 0
说_geo_ct > _geographyLength
是一种尴尬且更昂贵的方式。但现在已经完全消失了。
不要引用语言名称。只需LANGUAGE plpgsql
。
对于varchar[]
数组,函数参数为fk_id
,但您稍后评论过:
这是一个
bigint
字段,代表一个地理区域(它实际上是15级的预计算s2cell
ID。)
我在15级上不知道 s2cell
id,但理想情况下,您会传递匹配数据类型的数组,或者如果这不是默认的选项text[]
。
此外,您评论过:
总共传递了13个
fk_id
。
这似乎是 VARIADIC
函数参数的完美用例。所以你的函数定义是:
CREATE OR REPLACE FUNCTION public.function_name(_fk_ids VARIADIC text[]) ...
详细说明:
很难将rCTE包裹在两个交替循环周围,但可能有一些SQL技巧:
WITH RECURSIVE cte AS (
SELECT g.id, g.geography::text, NULL::text AS path, text 'geo' AS type
FROM geographies g
WHERE g.fk_id = ANY($kf_ids) -- your input array here
UNION
SELECT p.id, g.geography::text, p.path::text
, CASE WHEN p.path IS NULL THEN 'geo' ELSE 'path' END AS type
FROM cte c
LEFT JOIN paths p ON c.type = 'geo'
AND ST_Intersects(c.geography::geography, p.path)
LEFT JOIN geographies g ON c.type = 'path'
AND ST_Intersects(g.geography, c.path::geography)
WHERE (p.path IS NOT NULL OR g.geography IS NOT NULL)
)
SELECT id, type FROM cte;
这就是全部。您需要与上面相同的索引。您可以将其包装到SQL或PL / pgSQL函数中以供重复使用。
投放到text
是必要的,因为geography
类型不是" hashable" (geometry
)相同。 (See this open PostGIS issue for details.)通过转换为text
解决此问题。仅凭(id, type)
行就是唯一的,我们可以忽略geography
列。返回geography
进行加入。不应该花费太多额外费用。
我们需要两个LEFT JOIN
所以不要排除行,因为在每次迭代时,两个表中只有一个可能会贡献更多的行。
最后的条件确保我们还没有完成,但是:
WHERE (p.path IS NOT NULL OR g.geography IS NOT NULL)
这是有效的,因为重复的发现被排除在临时 中间表。 The manual:
对于
UNION
(但不是UNION ALL
),丢弃重复的行和行 复制任何以前的结果行。包括所有剩余的行 递归查询的结果,并将它们放在临时 中间表。
rCTE可能比小结果集的功能更快。函数中的临时表和索引意味着更多的开销。但是,对于大型结果集,函数可能更快。只有使用您的实际设置进行测试才能给您一个明确的答案。*
*请参阅the OP's feedback in the comment。
答案 1 :(得分:3)
我认为即使它不是最佳的,也可以在这里发布我自己的解决方案。
这是我提出的(使用Steve Chambers的建议):
CREATE OR REPLACE FUNCTION public.function_name(
_fk_ids character varying[])
RETURNS TABLE(id uuid, type character varying)
LANGUAGE 'plpgsql'
COST 100.0
VOLATILE
ROWS 1000.0
AS $function$
DECLARE
_pathLength bigint;
_geographyLength bigint;
_currentPathLength bigint;
_currentGeographyLength bigint;
BEGIN
DROP TABLE IF EXISTS _pathIds;
DROP TABLE IF EXISTS _geographyIds;
CREATE TEMPORARY TABLE _pathIds (id UUID PRIMARY KEY);
CREATE TEMPORARY TABLE _geographyIds (id UUID PRIMARY KEY);
-- get all geographies in the specified _fk_ids
INSERT INTO _geographyIds
SELECT g.id
FROM geographies g
WHERE g.fk_id= ANY(_fk_ids);
_pathLength := 0;
_geographyLength := 0;
_currentPathLength := 0;
_currentGeographyLength := (SELECT COUNT(_geographyIds.id) FROM _geographyIds);
-- _pathIds := ARRAY[]::uuid[];
WHILE (_currentPathLength - _pathLength > 0) OR (_currentGeographyLength - _geographyLength > 0) LOOP
_pathLength := (SELECT COUNT(_pathIds.id) FROM _pathIds);
_geographyLength := (SELECT COUNT(_geographyIds.id) FROM _geographyIds);
-- gets all paths that have paths that intersect the geographies that aren't in the current list of path ids
INSERT INTO _pathIds
SELECT DISTINCT p.id
FROM paths p
JOIN geographies g ON ST_Intersects(g.geography, p.path)
WHERE
g.id IN (SELECT _geographyIds.id FROM _geographyIds) AND
p.id NOT IN (SELECT _pathIds.id from _pathIds);
-- gets all geographies that have paths that intersect the paths that aren't in the current list of geography ids
INSERT INTO _geographyIds
SELECT DISTINCT g.id
FROM geographies g
JOIN paths p ON ST_Intersects(g.geography, p.path)
WHERE
p.id IN (SELECT _pathIds.id FROM _pathIds) AND
g.id NOT IN (SELECT _geographyIds.id FROM _geographyIds);
_currentPathLength := (SELECT COUNT(_pathIds.id) FROM _pathIds);
_currentGeographyLength := (SELECT COUNT(_geographyIds.id) FROM _geographyIds);
END LOOP;
RETURN QUERY
SELECT _geographyIds.id, 'geography' AS type FROM _geographyIds
UNION ALL
SELECT _pathIds.id, 'path' AS type FROM _pathIds;
END;
$function$;
答案 2 :(得分:1)
Sample plot and data from this script
它可以是具有聚合函数的纯关系。此实现使用一个path
表和一个point
表。两者都是几何形状。与通用地理相比,使用和测试更容易创建测试数据,但它应该很容易适应。
create table path (
path_text text primary key,
path geometry(linestring) not null
);
create table point (
point_text text primary key,
point geometry(point) not null
);
保持聚合函数状态的类型:
create type mpath_mpoint as (
mpath geometry(multilinestring),
mpoint geometry(multipoint)
);
国家建设职能:
create or replace function path_point_intersect (
_i mpath_mpoint[], _e mpath_mpoint
) returns mpath_mpoint[] as $$
with e as (select (e).mpath, (e).mpoint from (values (_e)) e (e)),
i as (select mpath, mpoint from unnest(_i) i (mpath, mpoint))
select array_agg((mpath, mpoint)::mpath_mpoint)
from (
select
st_multi(st_union(i.mpoint, e.mpoint)) as mpoint,
(
select st_collect(gd)
from (
select gd from st_dump(i.mpath) a (a, gd)
union all
select gd from st_dump(e.mpath) b (a, gd)
) s
) as mpath
from i inner join e on st_intersects(i.mpoint, e.mpoint)
union all
select i.mpoint, i.mpath
from i inner join e on not st_intersects(i.mpoint, e.mpoint)
union all
select e.mpoint, e.mpath
from e
where not exists (
select 1 from i
where st_intersects(i.mpoint, e.mpoint)
)
) s;
$$ language sql;
汇总:
create aggregate path_point_agg (mpath_mpoint) (
sfunc = path_point_intersect,
stype = mpath_mpoint[]
);
此查询将返回一组包含匹配路径/点的multilinestring, multipoint
字符串:
select st_astext(mpath), st_astext(mpoint)
from unnest((
select path_point_agg((st_multi(path), st_multi(mpoint))::mpath_mpoint)
from (
select path, st_union(point) as mpoint
from
path
inner join
point on st_intersects(path, point)
group by path
) s
)) m (mpath, mpoint)
;
st_astext | st_astext
-----------------------------------------------------------+-----------------------------
MULTILINESTRING((-10 0,10 0,8 3),(0 -10,0 10),(2 1,4 -1)) | MULTIPOINT(0 0,0 5,3 0,5 0)
MULTILINESTRING((-9 -8,4 -8),(-8 -9,-8 6)) | MULTIPOINT(-8 -8,2 -8)
MULTILINESTRING((-7 -4,-3 4,-5 6)) | MULTIPOINT(-6 -2)