Question

我想使用Postgres对地址进行一些基本的地理编码。我有一个地址表，有大约100万个原始地址字符串：

=> \d addresses
  Table "public.addresses"
 Column  | Type | Modifiers
---------+------+-----------
 address | text |

我还有一张位置数据表：

=> \d locations
   Table "public.locations"
   Column   | Type | Modifiers
------------+------+-----------
 id         | text |
 country    | text |
 postalcode | text |
 latitude   | text |
 longitude  | text |

大多数地址字符串包含邮政编码，所以我的第一次尝试是做类似和横向连接：

EXPLAIN SELECT * FROM addresses a
JOIN LATERAL (
    SELECT * FROM locations
    WHERE address ilike '%' || postalcode || '%'
    ORDER BY LENGTH(postalcode) DESC
    LIMIT 1
) AS l ON true;

这给出了预期的结果，但结果很慢。这是查询计划：

                                      QUERY PLAN
--------------------------------------------------------------------------------------
 Nested Loop  (cost=18383.07..18540688323.77 rows=1008572 width=91)
   ->  Seq Scan on addresses a  (cost=0.00..20997.72 rows=1008572 width=56)
   ->  Limit  (cost=18383.07..18383.07 rows=1 width=35)
         ->  Sort  (cost=18383.07..18391.93 rows=3547 width=35)
               Sort Key: (length(locations.postalcode))
               ->  Seq Scan on locations  (cost=0.00..18365.33 rows=3547 width=35)
                     Filter: (a.address ~~* (('%'::text || postalcode) || '%'::text))

我尝试在地址列中添加gist trigram索引，如https://stackoverflow.com/a/13452528/36191所述，但上述查询的查询计划并未使用它，查询计划也未更改。< / p>

CREATE INDEX idx_address ON addresses USING gin (address gin_trgm_ops);

我必须删除顺序和限制横向连接查询才能使用索引，这并不能给我想要的结果。这是不带ORDER或LIMIT的查询的查询计划：

                                          QUERY PLAN
-----------------------------------------------------------------------------------------------
 Nested Loop  (cost=39.35..129156073.06 rows=3577682241 width=86)
   ->  Seq Scan on locations  (cost=0.00..12498.55 rows=709455 width=28)
   ->  Bitmap Heap Scan on addresses a  (cost=39.35..131.60 rows=5043 width=58)
         Recheck Cond: (address ~~* (('%'::text || locations.postalcode) || '%'::text))
         ->  Bitmap Index Scan on idx_address  (cost=0.00..38.09 rows=5043 width=0)
               Index Cond: (address ~~* (('%'::text || locations.postalcode) || '%'::text))

我可以做些什么来让查询使用索引，还是有更好的方法来重写这个查询？

Answer 1

为什么？

查询无法使用主体索引。您需要表locations上的索引，但您所拥有的索引位于表addresses上。

您可以通过设置：

来验证我的声明

SET enable_seqscan = off;

（仅在你的会话中，仅用于调试。永远不要在生产中使用它。）这不像索引比顺序扫描更昂贵，Postgres没有办法将它用于你的查询完全。

除此之外：[INNER] JOIN ... ON true只是说CROSS JOIN ...

的尴尬方式

为什么删除`ORDER`和`LIMIT`后会使用索引？

因为Postgres可以将这个简单的表格重写为：

SELECT *
FROM   addresses a
JOIN   locations l ON a.address ILIKE '%' || l.postalcode || '%';

您将看到完全相同的查询计划。（至少我在Postgres 9.5的测试中做了。）

解决方案

您需要locations.postalcode上的索引。在使用LIKE或ILIKE时，您还需要将索引表达式（postalcode）带到运算符的 left 侧。 ILIKE是使用运算符~~*实现的，并且此运算符没有COMMUTATOR（逻辑必然性），因此无法翻转操作数。这些相关答案中的详细解释：

解决方案是使用trigram similarity operator %或其反转，distance operator <->代替最近邻居查询（每个都是自身的换向器，因此操作数可以自由切换位置）：

SELECT *
FROM   addresses a
JOIN   LATERAL (
   SELECT *
   FROM   locations
   ORDER  BY postalcode <-> a.address
   LIMIT  1
   ) l ON address ILIKE '%' || postalcode || '%';

为每个postalcode找到最相似的address，然后检查postalcode是否实际完全匹配。

这样，自动选择较长的postalcode，因为它更相似（距离更小），而不是更短的postalcode。

仍然存在一些不确定性。根据可能的邮政编码，由于字符串其他部分中的三元组匹配，可能会出现误报。问题中没有足够的信息可以说更多。

此处，[INNER] JOIN代替CROSS JOIN是有道理的，因为我们添加了一个实际的连接条件。

The manual:

这可以通过GiST索引非常有效地实现，但不能通过GIN索引实现。

所以：

CREATE INDEX locations_postalcode_trgm_gist_idx ON locations
USING gist (postalcode gist_trgm_ops);

Answer 2

这是一个远景，但以下替代方案如何表现？

In [6]: [d for d in range(1,31) if 
         is_business_day(datetime.date(2017,5,d))][9-1]
Out[6]: 15

In [7]: x=9

In [8]: [d for d in range(1,31) if 
         is_business_day(datetime.date(2017,5,d))][x-1]
Out[8]: 15

Answer 3

如果你将横向连接向内翻转，它可以工作。但即便如此，它可能仍然很慢

SELECT DISTINCT ON (address) *
FROM (
    SELECT * 
    FROM locations
       ,LATERAL(
           SELECT * FROM addresses
           WHERE address ilike '%' || postalcode || '%'
           OFFSET 0 -- force fencing, might be redundant
        ) a
) q
ORDER BY address, LENGTH(postalcode) DESC

缺点是你只能在邮政编码而不是地址上实现分页。

LATERAL JOIN不使用trigram索引

3 个答案:

为什么？

为什么删除`ORDER`和`LIMIT`后会使用索引？

解决方案

LATERAL JOIN不使用trigram索引

3 个答案:

为什么？

为什么删除ORDER和LIMIT后会使用索引？

解决方案

为什么删除`ORDER`和`LIMIT`后会使用索引？