Question

我有一个从Django服务器访问的MySQL InnoDB数据库。我有这张桌子：

+--------------+--------------+------+-----+---------+----------------+
| Field        | Type         | Null | Key | Default | Extra          |
+--------------+--------------+------+-----+---------+----------------+
| id           | int(11)      | NO   | PRI | NULL    | auto_increment |
| areasymbol   | varchar(255) | NO   |     | NULL    |                |
| spatialver   | int(11)      | YES  |     | NULL    |                |
| lkey         | int(11)      | YES  |     | NULL    |                |
| musym        | varchar(255) | NO   |     | NULL    |                |
| mukey        | int(11)      | YES  |     | NULL    |                |
| featsym      | varchar(255) | NO   |     | NULL    |                |
| featkey      | int(11)      | YES  |     | NULL    |                |
| north        | double       | YES  | MUL | NULL    |                |
| south        | double       | YES  | MUL | NULL    |                |
| east         | double       | YES  | MUL | NULL    |                |
| west         | double       | YES  | MUL | NULL    |                |
| soil_type_id | int(11)      | YES  | MUL | NULL    |                |
+--------------+--------------+------+-----+---------+----------------+

该表目前包含约7-8百万行，并且我预计它在我完成时将至少有3倍。这是一张静态表。我们每隔一段时间就会进行一次导入以添加内容，但不会修改或删除任何内容。

+-----------------+------------+----------------------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table           | Non_unique | Key_name                         | Seq_in_index | Column_name  | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+-----------------+------------+----------------------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| soil_soilregion |          0 | PRIMARY                          |            1 | id           | A         |     7657769 |     NULL | NULL   |      | BTREE      |         |               |
| soil_soilregion |          1 | soil_soilregion_e733fdfc         |            1 | soil_type_id | A         |          15 |     NULL | NULL   | YES  | BTREE      |         |               |
| soil_soilregion |          1 | north_soilregion                 |            1 | north        | A         |     7657769 |     NULL | NULL   | YES  | BTREE      |         |               |
| soil_soilregion |          1 | south_soilregion                 |            1 | south        | A         |     7657769 |     NULL | NULL   | YES  | BTREE      |         |               |
| soil_soilregion |          1 | east_soilregion                  |            1 | east         | A         |     7657769 |     NULL | NULL   | YES  | BTREE      |         |               |
| soil_soilregion |          1 | west_soilregion                  |            1 | west         | A         |     7657769 |     NULL | NULL   | YES  | BTREE      |         |               |
| soil_soilregion |          1 | north_south_east_west_soilregion |            1 | north        | A         |     7657769 |     NULL | NULL   | YES  | BTREE      |         |               |
| soil_soilregion |          1 | north_south_east_west_soilregion |            2 | south        | A         |     7657769 |     NULL | NULL   | YES  | BTREE      |         |               |
| soil_soilregion |          1 | north_south_east_west_soilregion |            3 | east         | A         |     7657769 |     NULL | NULL   | YES  | BTREE      |         |               |
| soil_soilregion |          1 | north_south_east_west_soilregion |            4 | west         | A         |     7657769 |     NULL | NULL   | YES  | BTREE      |         |               |
+-----------------+------------+----------------------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+

我有一个北/南/东/西坐标的方框，我正在寻找可能与该框重叠的任何区域

当我在数据库上运行此查询时：

select *
from soil_soilregion
where east > -86.8379775155 AND north > 40.3782334957 AND
      south < 40.3817576747 AND west < -86.8240119179;

这需要大约10秒钟，这是不可接受的。当我使用说明时，这是它告诉我的：

+----+-------------+-----------------+------+----------------------------------------------------------------------------------------------------+------+---------+------+---------+-------------+
| id | select_type | table           | type | possible_keys                                                                                      | key  | key_len | ref  | rows    | Extra       |
+----+-------------+-----------------+------+----------------------------------------------------------------------------------------------------+------+---------+------+---------+-------------+
|  1 | SIMPLE      | soil_soilregion | ALL  | north_soilregion,south_soilregion,east_soilregion,west_soilregion,north_south_east_west_soilregion | NULL | NULL    | NULL | 7657769 | Using where |
+----+-------------+-----------------+------+----------------------------------------------------------------------------------------------------+------+---------+------+---------+-------------+

当我在数据库上运行此查询时：

select *
from soil_soilregion
where east > -86.8379775155 AND east < -85.8379775155 AND
      north > 40.3782334957 AND north < 41.3782334957 AND
      south < 40.3817576747 AND south > 39.3817576747 AND
      west < -86.8240119179 AND west > -87.8040119189;

需要6-7秒。这样更好，但仍然不是最理想的。这段代码仍然可以工作，因为没有任何物体超过1高或宽（所以我给它在每个方向1的最大距离）。

我有几个问题：

为什么第一个查询不使用索引？（我认为这是因为该范围内的潜在物品太多了）
为什么它从不使用我的复合索引？这不是最优化的吗？
我可以做些什么来改进这个查询或索引吗？

注意：使用强制索引只会产生负面影响。

谢谢！

编辑1：根据建议，我将查询更改为与复合索引的顺序相同，这就是我得到的结果：

explain select * from soil_soilregion  where north > 40.3782334957 AND south < 40.3817576747 AND east > -86.8379775155 AND west < -86.8240119179; 
+----+-------------+-----------------+------+----------------------------------------------------------------------------------------------------+------+---------+------+---------+-------------+
| id | select_type | table           | type | possible_keys                                                                                      | key  | key_len | ref  | rows    | Extra       |
+----+-------------+-----------------+------+----------------------------------------------------------------------------------------------------+------+---------+------+---------+-------------+
|  1 | SIMPLE      | soil_soilregion | ALL  | north_soilregion,south_soilregion,east_soilregion,west_soilregion,north_south_east_west_soilregion | NULL | NULL    | NULL | 7657769 | Using where |
+----+-------------+-----------------+------+----------------------------------------------------------------------------------------------------+------+---------+------+---------+-------------+

Answer 1

您的查询存在的问题是您的不平等。唉，这些限制了索引的使用 - 每个索引查找最多只有一个不等式。

您需要解决此问题的数据结构是多维索引。在SQL数据库中，这通常使用GIS扩展提供，这些扩展记录为here。

如果没有这些扩展，你可以尝试神秘聪明。我可以想出解决这个问题的方法，但它会使表和查询更加复杂。为east和north添加一个新列，它是一个整数：easti和northi。然后，在easti, northi上构建索引。并将查询写为：

select *
from ((select sr.*
       from soil_soilregion sr
       where easti = -86 and northi in (40, 41)
      ) union all
      (select sr.*
       from soil_soilregion sr
       where easti = -85 and northi in (40, 41)
      ) 
     ) sr
where east > -86.8379775155 AND north > 40.3782334957 AND
      south < 40.3817576747 AND west < -86.8240119179;

子查询将把所有东西放在一个相对较小的盒子里。然后由外部查询过滤。子查询应该使用索引，所以它应该非常快。

考虑到你所寻找的大小，使用一个度数的分数比整数转换的整个度数更好。

Answer 2

A short-term, but partial fix is to have a "covering index". That is, make an index that has the bounding box, plus the id (and maybe the soil type?). then do this:

SELECT b.*
    FROM (
        SELECT id FROM soilregion
            WHERE east... AND west ... AND ...
         ) AS a
    JOIN soilregion AS b ON b.id = a.id;

This is likely to speed up the query because of:

The index is all that is needed in the subquery
The index is smaller than the data
When the subquery is finished, it has a short list of ids, which are easily and quickly looked up in the real table (via the JOIN).

Some of your 'why' questions:

The individual indexes merely eliminate some fraction of the 7M rows (as in "everything east of here"). That does not help much. Furthermore, when an index is that 'useless', it is not used -- it is faster to simply scan the table.
The compound index (north-south...) does not do any better. This is because it starts with a range test on north and can't get past that.
The second attempt 'seemed' to be faster -- this could be because of caching, not because of it being any better.

Solutions?...

Plan A: Spatial index as Gordon mentioned.

Plan B: Restructure the data to work with a pseudo-2D indexing method described in my "find the nearest pizza parlors" blog. A problem: I have not thought through how to adapt for "overlapping" instead of "nearest".

为什么我的查询不是最理想的？

2 个答案: