解决方案

Question

我想使用Postgres和PostGIS编写查询。我也在使用rgeo，rgeo-activerecord和activerecord-postgis-adapter的Rails，但Rails的内容相当不重要。

表格结构：

measurement
 - int id
 - int anchor_id
 - Point groundtruth
 - data (not important for the query)

示例数据：

id | anchor_id | groundtruth | data
-----------------------------------
1  | 1         | POINT(1 4)  | ...
2  | 3         | POINT(1 4)  | ...
3  | 2         | POINT(1 4)  | ...
4  | 3         | POINT(1 4)  | ...
-----------------------------------
5  | 2         | POINT(3 2)  | ...
6  | 4         | POINT(3 2)  | ...
-----------------------------------
7  | 1         | POINT(4 3)  | ...
8  | 1         | POINT(4 3)  | ...
9  | 1         | POINT(4 3)  | ...
10 | 5         | POINT(4 3)  | ...
11 | 3         | POINT(4 3)  | ...

此表是某种手动创建的view，用于更快的查找（具有数百万行）。否则我们必须加入8个表格，它会变得更慢。但这不是问题的一部分。

简单版本：

参数：

点p
int d

查询应该执行的操作：

1。该查询会查找点groundtruth

中distance < d的所有p点

SQL非常简单：WHERE st_distance(groundtruth, p) < d

2。现在，我们列出了groundtruth个anchor_id点。正如您在上表中所看到的，可能有多个相同的groundtruth-anchor_id元组。例如：anchor_id=3和groundtruth=POINT(1 4)。

3。接下来，我想通过随机选择其中一个（！）来消除相同的元组。为什么不简单地采取第一个？因为data列不同。

在SQL中选择一个随机行：SELECT ... ORDER BY RANDOM() LIMIT 1

我所有这一切的问题是：我可以想象一个使用SQL LOOP和很多子查询的解决方案，但是确定使用GROUP BY或某些解决方案的解决方案其他方法会使它更快。

完整版：

基本上与上述相同，但有一点不同：输入参数改变：

很多点数p1 ... p312456345
还有一个d

如果简单查询有效，可以使用SQL中的LOOP来完成。但也许有更好（更快）的解决方案，因为数据库非常庞大！

解决方案

WITH ps AS (SELECT unnest(p_array) AS p)
SELECT DISTINCT ON (anchor_id, groundtruth)
    *
FROM measurement m, ps
WHERE EXISTS (
    SELECT 1
    FROM ps
    WHERE st_distance(m.groundtruth, ps.p) < d
)
ORDER BY anchor_id, groundtruth, random();

感谢Erwin Brandstetter！

Answer 1

为了消除重复，这可能是PostgreSQL中最有效的查询：

SELECT DISTINCT ON (anchor_id, groundtruth) *
FROM   measurement
WHERE  st_distance(p, groundtruth) < d

有关此查询样式的更多信息：

Select first row in each GROUP BY group?

如评论中所述，这为您提供了任意选择。如果你需要随机，有点贵：

SELECT DISTINCT ON (anchor_id, groundtruth) *
FROM   measurement
WHERE  st_distance(p, groundtruth) < d
ORDER  BY anchor_id, groundtruth, random()

第二部分难以优化。 EXISTS半连接可能是最快的选择。对于给定的表ps (p point)：

SELECT DISTINCT ON (anchor_id, groundtruth) *
FROM   measurement m
WHERE  EXISTS (
   SELECT 1
   FROM   ps
   WHERE  st_distance(ps.p, m.groundtruth) < d
   )
ORDER  BY anchor_id, groundtruth, random();

只要一个p足够接近就可以停止评估，并且可以使查询的其余部分保持简单。

请务必使用a matching GiST index支持。

如果您有一个数组作为输入，请动态创建一个CTE unnest()：

WITH ps AS (SELECT unnest(p_array) AS p)
SELECT ...

根据评论更新

如果您只需要单行作为答案，则可以简化：

WITH ps AS (SELECT unnest(p_array) AS p)
SELECT *
FROM   measurement m
WHERE  EXISTS (
   SELECT 1
   FROM   ps
   WHERE  st_distance(ps.p, m.groundtruth) < d
   )
LIMIT  1;

`ST_DWithin()`

更快

使用函数ST_DWithin()（以及匹配的GiST索引！）可能效率更高要获得一个行（在此处使用子选择而不是CTE）：

SELECT *
FROM   measurement m
JOIN  (SELECT unnest(p_array) AS p) ps ON ST_DWithin(ps.p, m.groundtruth, d)
LIMIT  1;

要在距离p内为每个点d 获取一行：

SELECT DISTINCT ON (ps.p) * FROM measurement m JOIN (SELECT unnest(p_array) AS p) ps ON ST_DWithin(ps.p, m.groundtruth, d)

添加ORDER BY random()会使此查询更加昂贵。如果没有random()，Postgres可以从GiST索引中选择第一个匹配行。否则所有可能的匹配必须被检索并随机排序。

BTW，LIMIT 1内的EXISTS毫无意义。阅读the manual at the link I provided或this related question。

Answer 2

我现在破解了它，但查询很慢......

WITH
  ps AS (
    SELECT unnest(p_array)
    ) AS p
  ),

  gtps AS (
    SELECT DISTINCT ON(ps.p)
      ps.p, m.groundtruth
    FROM measurement m, ps
    WHERE st_distance(m.groundtruth, ps.p) < d
    ORDER BY ps.p, RANDOM()
  )

SELECT DISTINCT ON(gtps.p, gtps.groundtruth, m.anchor_id)
  m.id, m.anchor_id, gtps.groundtruth, gtps.p
FROM measurement m, gtps
ORDER BY gtps.p, gtps.groundtruth, m.anchor_id, RANDOM()

我的测试数据库包含22000行，我给它两个输入值，大约需要700毫秒。最后可以有数百个输入值： - /

结果现在看起来像这样：

id  | anchor_id | groundtruth | p
-----------------------------------------
20  | 1         | POINT(0 2)  | POINT(1 0)
14  | 3         | POINT(0 2)  | POINT(1 0)
5   | 8         | POINT(0 2)  | POINT(1 0)
42  | 2         | POINT(4 1)  | POINT(2 2)
11  | 3         | POINT(4 8)  | POINT(4 8)
4   | 6         | POINT(4 8)  | POINT(4 8)
1   | 1         | POINT(6 2)  | POINT(7 3)
9   | 5         | POINT(6 2)  | POINT(7 3)
25  | 3         | POINT(6 2)  | POINT(9 1)
13  | 6         | POINT(6 2)  | POINT(9 1)
18  | 7         | POINT(6 2)  | POINT(9 1)

NEW：

SELECT
  m.groundtruth, ps.p, ARRAY_AGG(m.anchor_id), ARRAY_AGG(m.id)
FROM
  measurement m
JOIN
  (SELECT unnest(point_array) AS p) AS ps
  ON ST_DWithin(ps.p, m.groundtruth, 0.5)
GROUP BY groundtruth, ps.p

实际结果：

p           | groundtruth | anchor_arr | id_arr
--------------------------------------------------
P1          | G1          | {1,3,2,..} | {9,8,11,..}
P1          | G2          | {4,3,5,..} | {1,8,23,..}
P1          | G3          | {6,8,9,..} | {12,7,6,..}
P2          | G1          | {6,6,2,..} | {15,2,10,..}
P2          | G4          | {7,9,1,..} | {5,4,3,..}
...         | ...         | ...        | ...

所以现在我得到了：

每个不同的inputValue-groundtruth-tuple
对于每个元组，我得到一个数组，其中所有anchor_id对应元组的groundtruth部分
以及与id - groundtruth关系

anchor_id

记住：

两个输入值可以“选择”相同的groundtruth
单个groundtruth值可以有多个相同的anchor_id s
每个groundtruth - anchor_id - 元组都有明显的id

那么完成时缺少什么？：

我只需要为每个ps.p
这两个阵列彼此属于一个。意思是：里面的物品的顺序很重要！
这两个数组需要过滤（随机）：
- 对于出现多次的数组中的每个anchor_id：保留一个随机的并删除所有其他的id。这也意味着为每个已删除的id

在按值（而不是列）分组后，从组中选择一个随机条目？

简单版本：

完整版：

解决方案

2 个答案:

根据评论更新

`ST_DWithin()`

NEW：