
时间:2019-03-22 14:34:52

标签: mysql performance query-optimization




SELECT device_uuid,
FROM tracking_daily_stats_zone_unique_device_uuids_per_hour
WHERE day_epoch >= 1552435200
AND day_epoch < 1553040000
AND venue_id = 46
AND zone_id IN (102,105,108,110,111,113,116,117,118,121,287)



explain statement






CREATE TABLE `tracking_daily_stats_zone_unique_device_uuids_per_hour` (
 `day_epoch` int(10) NOT NULL,
 `day_of_week` tinyint(1) NOT NULL COMMENT 'day of week, monday = 1',
 `hour` int(2) NOT NULL,
 `venue_id` int(5) NOT NULL,
 `zone_id` int(5) NOT NULL,
 `device_uuid` binary(16) NOT NULL COMMENT 'binary representation of the device_uuid, unique for a single day',
 `device_vendor_id` int(5) unsigned NOT NULL DEFAULT '0' COMMENT 'id of the device vendor',
 `first_seen` int(10) unsigned NOT NULL DEFAULT '0',
 `last_seen` int(10) unsigned NOT NULL DEFAULT '0',
 `is_repeat` tinyint(1) NOT NULL COMMENT 'is the device a repeat for this day?',
 `prev_last_seen` int(10) NOT NULL DEFAULT '0' COMMENT 'previous last seen ts',
 PRIMARY KEY (`id`,`venue_id`) USING BTREE,
 KEY `venue_id` (`venue_id`),
 KEY `zone_id` (`zone_id`),
 KEY `day_of_week` (`day_of_week`),
 KEY `day_epoch` (`day_epoch`),
 KEY `hour` (`hour`),
 KEY `device_uuid` (`device_uuid`),
 KEY `is_repeat` (`is_repeat`),
 KEY `device_vendor_id` (`device_vendor_id`)
/*!50100 PARTITION BY HASH (venue_id)

3 个答案:

答案 0 :(得分:1)


ALTER TABLE tracking_daily_stats_zone_unique_device_uuids_per_hour 
ADD INDEX complex_idx (`venue_id`, `day_epoch`, `zone_id`)



SELECT device_uuid,
FROM tracking_daily_stats_zone_unique_device_uuids_per_hour
USE INDEX (complex_idx)
WHERE day_epoch >= 1552435200
AND day_epoch < 1553040000
AND venue_id = 46
AND zone_id IN (102,105,108,110,111,113,116,117,118,121,287)



SELECT device_uuid,
FROM tracking_daily_stats_zone_unique_device_uuids_per_hour
WHERE day_epoch >= 1552435200
AND day_epoch < 1553040000
AND zone_id IN (102,105,108,110,111,113,116,117,118,121,287)

其中p46pvenue_id = 46的串联字符串

这是另一个技巧。您可以从AND venue_id = 46子句中删除WHERE。因为该分区中没有其他数据。

答案 1 :(得分:0)

如果更改条件的顺序会怎样?首先放置venue_id = ?。顺序很重要。

-day_epoch >= 1552435200
-然后,剩下的day_epoch < 1553040000
-然后,剩下的venue_id = 46
-然后,剩下的zone_id IN (102,105,108,110,111,113,116,117,118,121,287)



异位:在插入/更新/删除时,在所有only slows you down上放置索引。为最少的列编制索引,仅对您实际进行过滤的列(例如在WHERE或GROUP BY中使用)编制索引。

答案 2 :(得分:0)

450M rows is rather large. So, I will discuss a variety of issues that can help.

Shrink data A big table leads to more I/O, which is the main performance killer. ('Small' tables tend to stay cached, and not have an I/O burden.)

  • Any kind of INT, even INT(2) takes 4 bytes. An "hour" can easily fit in a 1-byte TINYINT. That saves over a 1GB in the data, plus a similar amount in INDEX(hour).
  • If hour and day_of_week can be derived, don't bother having them as separate columns. This will save more space.
  • Some reason to use a 4-byte day_epoch instead of a 3-byte DATE? Or perhaps you do need a 5-byte DATETIME or TIMESTAMP.

Optimal INDEX (take #1)

If it is always a single venue_id, then either this is a good first cut at the optimal index:

INDEX(venue_id, zone_id, day_epoch)

First is the constant, then the IN, then a range. The Optimizer does well with this in many cases. (It is unclear whether the number of items in an IN clause can lead to inefficiencies.)

Better Primary Key (better index)

With AUTO_INCREMENT, there is probably no good reason to include columns after the auto_inc column in the PK. That is, PRIMARY KEY(id, venue_id) is no better than PRIMARY KEY(id).

InnoDB orders the data's BTree according to the PRIMARY KEY. So, if you are fetching several rows and can arrange for them to be adjacent to each other based on the PK, you get extra performance. (cf "Clustered".) So:

PRIMARY KEY(venue_id, zone_id, day_epoch,  -- this order, as discussed above;
            id)    -- to make sure that the entire PK is unique.
INDEX(id)      -- to keep AUTO_INCREMENT happy

And, I agree with DROPping any indexes that are not in use, including the one I recommended above. It is rarely useful to index flags (is_repeat).


Indexing a UUID can be deadly for performance once the table is really big. This is because of the randomness of UUIDs/GUIDs, leading to ever-increasing I/O burden to insert new entries in the index.


Assuming day_epoch is sometimes multiple days, you seem to have 2 or 3 "dimensions":

  • A date range
  • A list of zones
  • A venue.

INDEXes are 1-dimensional. Therein lies the problem. However, PARTITIONing can sometimes help. I discuss this briefly as "case 2" in http://mysql.rjweb.org/doc.php/partitionmaint .

There is no good way to get 3 dimensions, so let's focus on 2.

  • You should partition on something that is a "range", such as day_epoch or zone_id.
  • After that, you should decide what to put in the PRIMARY KEY so that you can further take advantage of "clustering".

Plan A: This assumes you are searching for only one venue_id at a time:

PARTITION BY RANGE(day_epoch)  -- see note below

PRIMARY KEY(venue_id, zone_id, id)

Plan B: This assumes you sometimes srefineearch for venue_id IN (.., .., ...), hence it does not make a good first column for the PK:

Well, I don't have good advice here; so let's go with Plan A.

The RANGE expression must be numeric. Your day_epoch works fine as is. Changing to a DATE, would necessitate BY RANGE(TO_DAYS(...)), which works fine.

You should limit the number of partitions to 50. (The 81 mentioned above is not bad.) The problem is that "lots" of partitions introduces different inefficiencies; "too few" partitions leads to "why bother".

Note that almost always the optimal PK is different for a partitioned table than the equivalent non-partitioned table.

Note that I disagree with partitioning on venue_id since it is so easy to put that column at the start of the PK instead.


Assuming you search for a single venue_id and use my suggested partitioning & PK, here's how the SELECT performs:

  1. Filter on the date range. This is likely to limit the activity to a single partition.
  2. Drill into the data's BTree for that one partition to find the one venue_id.
  3. Hopscotch through the data from there, landing on the desired zone_ids.
  4. For each, further filter based the date.