Question

嗨，我目前有一个查询，该查询需要11（sec）才能运行。我有一个报告，该报告显示在运行4个不同查询的网站上，每个查询都需要11（sec）才能运行。我真的不希望客户为所有这些查询而等待一分钟才能运行并显示数据。

我正在使用4个不同的AJAX请求来调用API以获取所需的数据，所有这些都立即启动，但查询又一次运行。如果有一种方法可以一次（并行）运行所有查询，因此总加载时间仅为11（秒），这也可以解决我的问题，但是我认为这是不可能的。

这是我正在运行的查询：

SELECT device_uuid,
     day_epoch,
     is_repeat
FROM tracking_daily_stats_zone_unique_device_uuids_per_hour
WHERE day_epoch >= 1552435200
AND day_epoch < 1553040000
AND venue_id = 46
AND zone_id IN (102,105,108,110,111,113,116,117,118,121,287)

我根本想不起来可以加快查询速度，下面是该查询的表索引和说明语句的图片。

我认为以上查询在where条件中使用了相关索引。

如果您有什么想加快查询速度的方法，请告诉我，我已经处理了3天，似乎无法找出问题所在。最好将查询时间减少到最大5（sec）。如果我对AJAX问题有误，请告诉我，因为这也会解决我的问题。

“ 编辑”

我遇到了很奇怪的事情，这可能是导致问题的原因。当我将day_epoch范围更改为较小的范围（第5至第9位）时，返回130,000行，查询时间为0.7（秒），但随后我又在该范围（第5至第10位）中增加了一天，返回的查询时间超过了150,000行是13（秒）。我运行了不同范围的负载，并且得出的结论是，如果返回的行数超过150,000，这会对查询时间产生巨大影响。

表定义-

CREATE TABLE `tracking_daily_stats_zone_unique_device_uuids_per_hour` (
 `id` int(11) NOT NULL AUTO_INCREMENT,
 `day_epoch` int(10) NOT NULL,
 `day_of_week` tinyint(1) NOT NULL COMMENT 'day of week, monday = 1',
 `hour` int(2) NOT NULL,
 `venue_id` int(5) NOT NULL,
 `zone_id` int(5) NOT NULL,
 `device_uuid` binary(16) NOT NULL COMMENT 'binary representation of the device_uuid, unique for a single day',
 `device_vendor_id` int(5) unsigned NOT NULL DEFAULT '0' COMMENT 'id of the device vendor',
 `first_seen` int(10) unsigned NOT NULL DEFAULT '0',
 `last_seen` int(10) unsigned NOT NULL DEFAULT '0',
 `is_repeat` tinyint(1) NOT NULL COMMENT 'is the device a repeat for this day?',
 `prev_last_seen` int(10) NOT NULL DEFAULT '0' COMMENT 'previous last seen ts',
 PRIMARY KEY (`id`,`venue_id`) USING BTREE,
 KEY `venue_id` (`venue_id`),
 KEY `zone_id` (`zone_id`),
 KEY `day_of_week` (`day_of_week`),
 KEY `day_epoch` (`day_epoch`),
 KEY `hour` (`hour`),
 KEY `device_uuid` (`device_uuid`),
 KEY `is_repeat` (`is_repeat`),
 KEY `device_vendor_id` (`device_vendor_id`)
) ENGINE=InnoDB AUTO_INCREMENT=450967720 DEFAULT CHARSET=utf8
/*!50100 PARTITION BY HASH (venue_id)
PARTITIONS 100 */

Answer 1

直接的解决方案是将此查询特定的索引添加到表中：

ALTER TABLE tracking_daily_stats_zone_unique_device_uuids_per_hour 
ADD INDEX complex_idx (`venue_id`, `day_epoch`, `zone_id`)

警告：此查询更改在DB上可能需要一段时间。

然后在呼叫时强制它：

SELECT device_uuid,
     day_epoch,
     is_repeat
FROM tracking_daily_stats_zone_unique_device_uuids_per_hour
USE INDEX (complex_idx)
WHERE day_epoch >= 1552435200
AND day_epoch < 1553040000
AND venue_id = 46
AND zone_id IN (102,105,108,110,111,113,116,117,118,121,287)

它绝对不是通用的，但应该适用于此特定查询。

更新对表进行分区后，可以通过强制使用特定的PARTITION来获利。在我们的例子中，由于是venue_id，只需强制它：

SELECT device_uuid,
     day_epoch,
     is_repeat
FROM tracking_daily_stats_zone_unique_device_uuids_per_hour
PARTITION (`p46`)
WHERE day_epoch >= 1552435200
AND day_epoch < 1553040000
AND zone_id IN (102,105,108,110,111,113,116,117,118,121,287)

其中p46是p和venue_id = 46的串联字符串

这是另一个技巧。您可以从AND venue_id = 46子句中删除WHERE。因为该分区中没有其他数据。

Answer 2

如果更改条件的顺序会怎样？首先放置venue_id = ?。顺序很重要。

现在，它首先检查所有行是否存在：
-day_epoch >= 1552435200
-然后，剩下的day_epoch < 1553040000集
-然后，剩下的venue_id = 46集
-然后，剩下的zone_id IN (102,105,108,110,111,113,116,117,118,121,287)

在处理繁重的查询时，应始终尝试使第一个“选择器”最有效。为此，您可以为1（或组合）索引使用适当的索引，并确保第一个选择器缩小的幅度最大（至少对于整数，如果是字符串，则需要其他策略）。

有时候，查询只是很慢。当您拥有大量数据（和/或资源不足）时，您实际上无法对此做任何事情。那就是您需要另一个解决方案的地方：制作一个汇总表。我怀疑您向访客显示150.000行x4。您可以例如每小时或每隔几分钟汇总一次，然后从该表格中选择较小的表格。

^{异位：在插入/更新/删除时，在所有only slows you down上放置索引。为最少的列编制索引，仅对您实际进行过滤的列（例如在WHERE或GROUP BY中使用）编制索引。}

Answer 3

450M rows is rather large. So, I will discuss a variety of issues that can help.

Shrink data A big table leads to more I/O, which is the main performance killer. ('Small' tables tend to stay cached, and not have an I/O burden.)

Any kind of INT, even INT(2) takes 4 bytes. An "hour" can easily fit in a 1-byte TINYINT. That saves over a 1GB in the data, plus a similar amount in INDEX(hour).
If hour and day_of_week can be derived, don't bother having them as separate columns. This will save more space.
Some reason to use a 4-byte day_epoch instead of a 3-byte DATE? Or perhaps you do need a 5-byte DATETIME or TIMESTAMP.

Optimal INDEX (take #1)

If it is always a single venue_id, then either this is a good first cut at the optimal index:

INDEX(venue_id, zone_id, day_epoch)

First is the constant, then the IN, then a range. The Optimizer does well with this in many cases. (It is unclear whether the number of items in an IN clause can lead to inefficiencies.)

Better Primary Key (better index)

With AUTO_INCREMENT, there is probably no good reason to include columns after the auto_inc column in the PK. That is, PRIMARY KEY(id, venue_id) is no better than PRIMARY KEY(id).

InnoDB orders the data's BTree according to the PRIMARY KEY. So, if you are fetching several rows and can arrange for them to be adjacent to each other based on the PK, you get extra performance. (cf "Clustered".) So:

PRIMARY KEY(venue_id, zone_id, day_epoch,  -- this order, as discussed above;
            id)    -- to make sure that the entire PK is unique.
INDEX(id)      -- to keep AUTO_INCREMENT happy

And, I agree with DROPping any indexes that are not in use, including the one I recommended above. It is rarely useful to index flags (is_repeat).

UUID

Indexing a UUID can be deadly for performance once the table is really big. This is because of the randomness of UUIDs/GUIDs, leading to ever-increasing I/O burden to insert new entries in the index.

Multi-dimensional

Assuming day_epoch is sometimes multiple days, you seem to have 2 or 3 "dimensions":

A date range
A list of zones
A venue.

INDEXes are 1-dimensional. Therein lies the problem. However, PARTITIONing can sometimes help. I discuss this briefly as "case 2" in http://mysql.rjweb.org/doc.php/partitionmaint .

There is no good way to get 3 dimensions, so let's focus on 2.

You should partition on something that is a "range", such as day_epoch or zone_id.
After that, you should decide what to put in the PRIMARY KEY so that you can further take advantage of "clustering".

Plan A: This assumes you are searching for only one venue_id at a time:

PARTITION BY RANGE(day_epoch)  -- see note below

PRIMARY KEY(venue_id, zone_id, id)

Plan B: This assumes you sometimes srefineearch for venue_id IN (.., .., ...), hence it does not make a good first column for the PK:

Well, I don't have good advice here; so let's go with Plan A.

The RANGE expression must be numeric. Your day_epoch works fine as is. Changing to a DATE, would necessitate BY RANGE(TO_DAYS(...)), which works fine.

You should limit the number of partitions to 50. (The 81 mentioned above is not bad.) The problem is that "lots" of partitions introduces different inefficiencies; "too few" partitions leads to "why bother".

Note that almost always the optimal PK is different for a partitioned table than the equivalent non-partitioned table.

Note that I disagree with partitioning on venue_id since it is so easy to put that column at the start of the PK instead.

Analysis

Assuming you search for a single venue_id and use my suggested partitioning & PK, here's how the SELECT performs:

Filter on the date range. This is likely to limit the activity to a single partition.
Drill into the data's BTree for that one partition to find the one venue_id.
Hopscotch through the data from there, landing on the desired zone_ids.
For each, further filter based the date.

提高MySQL查询速度-150,000多行返回使查询变慢

3 个答案: