我想使用Postgres和PostGIS编写查询。我也在使用rgeo
,rgeo-activerecord
和activerecord-postgis-adapter
的Rails,但Rails的内容相当不重要。
表格结构:
measurement
- int id
- int anchor_id
- Point groundtruth
- data (not important for the query)
示例数据:
id | anchor_id | groundtruth | data
-----------------------------------
1 | 1 | POINT(1 4) | ...
2 | 3 | POINT(1 4) | ...
3 | 2 | POINT(1 4) | ...
4 | 3 | POINT(1 4) | ...
-----------------------------------
5 | 2 | POINT(3 2) | ...
6 | 4 | POINT(3 2) | ...
-----------------------------------
7 | 1 | POINT(4 3) | ...
8 | 1 | POINT(4 3) | ...
9 | 1 | POINT(4 3) | ...
10 | 5 | POINT(4 3) | ...
11 | 3 | POINT(4 3) | ...
此表是某种手动创建的view
,用于更快的查找(具有数百万行)。否则我们必须加入8个表格,它会变得更慢。但这不是问题的一部分。
参数:
p
d
查询应该执行的操作:
1。该查询会查找点groundtruth
distance < d
的所有p
点
SQL非常简单:WHERE st_distance(groundtruth, p) < d
2。现在,我们列出了groundtruth
个anchor_id
点。正如您在上表中所看到的,可能有多个相同的groundtruth-anchor_id元组。例如:anchor_id=3
和groundtruth=POINT(1 4)
。
3。接下来,我想通过随机选择其中一个(!)来消除相同的元组。为什么不简单地采取第一个?因为data
列不同。
在SQL中选择一个随机行:SELECT ... ORDER BY RANDOM() LIMIT 1
我所有这一切的问题是:我可以想象一个使用SQL LOOP
和很多子查询的解决方案,但是确定使用GROUP BY
或某些解决方案的解决方案其他方法会使它更快。
基本上与上述相同,但有一点不同:输入参数改变:
p1
... p312456345
d
如果简单查询有效,可以使用SQL中的LOOP
来完成。但也许有更好(更快)的解决方案,因为数据库非常庞大!
WITH ps AS (SELECT unnest(p_array) AS p)
SELECT DISTINCT ON (anchor_id, groundtruth)
*
FROM measurement m, ps
WHERE EXISTS (
SELECT 1
FROM ps
WHERE st_distance(m.groundtruth, ps.p) < d
)
ORDER BY anchor_id, groundtruth, random();
感谢Erwin Brandstetter!
答案 0 :(得分:1)
为了消除重复,这可能是PostgreSQL中最有效的查询:
SELECT DISTINCT ON (anchor_id, groundtruth) *
FROM measurement
WHERE st_distance(p, groundtruth) < d
有关此查询样式的更多信息:
如评论中所述,这为您提供了任意选择。如果你需要随机,有点贵:
SELECT DISTINCT ON (anchor_id, groundtruth) *
FROM measurement
WHERE st_distance(p, groundtruth) < d
ORDER BY anchor_id, groundtruth, random()
第二部分难以优化。 EXISTS
半连接可能是最快的选择。对于给定的表ps (p point)
:
SELECT DISTINCT ON (anchor_id, groundtruth) *
FROM measurement m
WHERE EXISTS (
SELECT 1
FROM ps
WHERE st_distance(ps.p, m.groundtruth) < d
)
ORDER BY anchor_id, groundtruth, random();
只要一个p
足够接近就可以停止评估,并且可以使查询的其余部分保持简单。
请务必使用a matching GiST index支持。
如果您有一个数组作为输入,请动态创建一个CTE unnest()
:
WITH ps AS (SELECT unnest(p_array) AS p)
SELECT ...
如果您只需要单行作为答案,则可以简化:
WITH ps AS (SELECT unnest(p_array) AS p)
SELECT *
FROM measurement m
WHERE EXISTS (
SELECT 1
FROM ps
WHERE st_distance(ps.p, m.groundtruth) < d
)
LIMIT 1;
ST_DWithin()
使用函数ST_DWithin()
(以及匹配的GiST索引!)可能效率更高
要获得一个行(在此处使用子选择而不是CTE):
SELECT *
FROM measurement m
JOIN (SELECT unnest(p_array) AS p) ps ON ST_DWithin(ps.p, m.groundtruth, d)
LIMIT 1;
要在距离p
内为每个点d
获取一行:
SELECT DISTINCT ON (ps.p) *
FROM measurement m
JOIN (SELECT unnest(p_array) AS p) ps ON ST_DWithin(ps.p, m.groundtruth, d)
添加ORDER BY random()
会使此查询更加昂贵。如果没有random()
,Postgres可以从GiST索引中选择第一个匹配行。否则所有可能的匹配必须被检索并随机排序。
LIMIT 1
内的EXISTS
毫无意义。阅读the manual at the link I provided或this related question。
答案 1 :(得分:0)
我现在破解了它,但查询很慢......
WITH
ps AS (
SELECT unnest(p_array)
) AS p
),
gtps AS (
SELECT DISTINCT ON(ps.p)
ps.p, m.groundtruth
FROM measurement m, ps
WHERE st_distance(m.groundtruth, ps.p) < d
ORDER BY ps.p, RANDOM()
)
SELECT DISTINCT ON(gtps.p, gtps.groundtruth, m.anchor_id)
m.id, m.anchor_id, gtps.groundtruth, gtps.p
FROM measurement m, gtps
ORDER BY gtps.p, gtps.groundtruth, m.anchor_id, RANDOM()
我的测试数据库包含22000行,我给它两个输入值,大约需要700毫秒。最后可以有数百个输入值: - /
结果现在看起来像这样:
id | anchor_id | groundtruth | p
-----------------------------------------
20 | 1 | POINT(0 2) | POINT(1 0)
14 | 3 | POINT(0 2) | POINT(1 0)
5 | 8 | POINT(0 2) | POINT(1 0)
42 | 2 | POINT(4 1) | POINT(2 2)
11 | 3 | POINT(4 8) | POINT(4 8)
4 | 6 | POINT(4 8) | POINT(4 8)
1 | 1 | POINT(6 2) | POINT(7 3)
9 | 5 | POINT(6 2) | POINT(7 3)
25 | 3 | POINT(6 2) | POINT(9 1)
13 | 6 | POINT(6 2) | POINT(9 1)
18 | 7 | POINT(6 2) | POINT(9 1)
SELECT
m.groundtruth, ps.p, ARRAY_AGG(m.anchor_id), ARRAY_AGG(m.id)
FROM
measurement m
JOIN
(SELECT unnest(point_array) AS p) AS ps
ON ST_DWithin(ps.p, m.groundtruth, 0.5)
GROUP BY groundtruth, ps.p
实际结果:
p | groundtruth | anchor_arr | id_arr
--------------------------------------------------
P1 | G1 | {1,3,2,..} | {9,8,11,..}
P1 | G2 | {4,3,5,..} | {1,8,23,..}
P1 | G3 | {6,8,9,..} | {12,7,6,..}
P2 | G1 | {6,6,2,..} | {15,2,10,..}
P2 | G4 | {7,9,1,..} | {5,4,3,..}
... | ... | ... | ...
所以现在我得到了:
anchor_id
对应元组的groundtruth
部分id
- groundtruth
关系anchor_id
的数组
记住:
groundtruth
groundtruth
值可以有多个相同的anchor_id
s groundtruth
- anchor_id
- 元组都有明显的id
那么完成时缺少什么?:
ps.p
anchor_id
:保留一个随机的并删除所有其他的id
。这也意味着为每个已删除的id
anchor_id
- 数组中删除相应的{{1}}