如何优化使用依赖子查询的查询

时间:2019-07-16 14:14:47

标签: mysql

我正在使用此NOT IN查询从单个表中返回非活动用户。

SELECT * 
  FROM 
     ( SELECT DISTINCT name
                  FROM userlog 
                 WHERE created >= '2019-07-07 00:00:00' - INTERVAL 30 DAY 
                   AND created <= '2019-07-13 23:59:59' - INTERVAL 30 DAY 
                   AND isSample = 0
     ) inactive 
 WHERE inactive.name NOT IN 
        ( 
     SELECT name AS name 
       FROM userlog 
      WHERE created >= '2019-07-13 23:59:59' - INTERVAL 30 DAY 
        AND created <= '2019-07-13 23:59:59' AND isSample = 0
        )

此查询的描述:

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ID | select_type |   table   | partitions | type |   possiblekeys   |     Keys     | key_len | ref |  rows  | filtered | extra                                  | 
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|  1 |   primary   | <derived2>|  (null)OK  |  ALL |       NULL       |      null    |   NULL  | NULL| 50000  |  100.00  | using where                            |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|  3 |   subquery  |  userlog  |  (null)OK  | range| *list of indexes |   nameindex  |   774   | NULL| 1000000|  10.00   | using index condition                  |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|  2 |   derived   |  userlog  |  (null)OK  | range| *list of indexes |   nameindex  |   774   | NULL| 500000 |  10.00   | using index condition; using temporary |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+

我不想基于名称进行查询,因为名称可能会更改,但是其ID永远不会更改,因此我改用ID进行查询。我使用相同的查询,只是更改字段

SELECT * 
  FROM 
     (SELECT DISTINCT(id) AS id
                 FROM userlog 
                 WHERE created >= '2019-07-07 00:00:00' - INTERVAL 30 DAY 
                 AND created <= '2019-07-13 23:59:59' - INTERVAL 30 DAY 
                 AND isSample = '0'
     ) inactive 
  WHERE inactive.id NOT IN 
    (SELECT id AS id
       FROM userlog 
       WHERE created >= '2019-07-13 23:59:59' - INTERVAL 30 DAY 
       AND created <= '2019-07-13 23:59:59' 
       AND isSample = '0')

现在此查询的描述与上面的不同:

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ID |   select_type    |   table   | partitions |     type     |   possiblekeys   |     Keys     | key_len | ref |  rows  | filtered | extra                                  | 
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|  1 |     primary      | <derived2>|  (null)OK  |      ALL     |       NULL       |      null    |   NULL  | NULL| 50000  |  100.00  | using where                            |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|  3 |dependent subquery|  userlog  |  (null)OK  |index_subquery| *list of indexes |   countindex |   768   | func|   892  |   0.61   | using where; full scan on null key     |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|  2 |     derived      |  userlog  |  (null)OK  |     range    | *list of indexes |    idindex   |   774   | NULL| 500000 |  10.00   | using index condition; using temporary |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

查询现在使用一个从属子查询,并且正在执行全表扫描,这在我的表上非常慢(20+百万条记录)。我注意到ID查询未使用idindex,但正在使用我的计数索引。如果我单独将每个查询分开,它们都将使用ID索引,但是当将它们组合为NOT IN时,将使用计数索引。

这是我的索引:

+--------------------------------------------------------------------------------------------------------------------------------+
|  TABLE  | NON_UNIQUE | KEY NAME | SEQ_IN_INDEX | COLUMN_NAME | COLLATION | CARDINALITY | SUB_PART | PACKED | NULL | INDEX_TYPE |
+--------------------------------------------------------------------------------------------------------------------------------+
| userlog |      1     |countindex|       1      |      id     |     A     |    75000    |   255    |  NULL  |  YES |   BTREE    |
+--------------------------------------------------------------------------------------------------------------------------------+
| userlog |      1     |countindex|       2      |      pk     |     A     |  11500000   |   null   |  NULL  |  YES |   BTREE    |
+--------------------------------------------------------------------------------------------------------------------------------+
| userlog |      1     |nameindex |       1      |   created   |     A     |   6800000   |   null   |  NULL  |  YES |   BTREE    |
+--------------------------------------------------------------------------------------------------------------------------------+
| userlog |      1     |nameindex |       2      |    sample   |     A     |  13500000   |   null   |  NULL  |  YES |   BTREE    |
+--------------------------------------------------------------------------------------------------------------------------------+
| userlog |      1     |nameindex |       3      |    name     |     A     |   24000000  |   null   |  NULL  |  YES |   BTREE    |
+--------------------------------------------------------------------------------------------------------------------------------+
| userlog |      1     | idindex  |       1      |      id     |     A     |    75000    |    512   |  NULL  |  YES |   BTREE    |
+--------------------------------------------------------------------------------------------------------------------------------+
| userlog |      1     | idindex  |       2      |   created   |     A     |   22000000  |   null   |  NULL  |  YES |   BTREE    |
+--------------------------------------------------------------------------------------------------------------------------------+
| userlog |      1     | idindex  |       3      |   sample    |     A     |   20500000  |   null   |  NULL  |  YES |   BTREE    |
+--------------------------------------------------------------------------------------------------------------------------------+

有人知道为什么要使用其他索引吗?

此外,有没有一种方法可以优化ID查询,从而这不是问题?

如果我缺少任何信息,我可以更新问题。

编辑:

这是下面答案的更新说明:

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ID |   select_type    |   table   | partitions |     type     |   possiblekeys   |     Keys     | key_len |   ref   |  rows  | filtered | extra                                              | 
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|  1 |     primary      |     t1    |  (null)OK  |     range    | *list of indexes |  nameindex   |   774   |   NULL  | 500000 |   10.00  | using index condition; using where; using temporary|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|  2 |dependent subquery|     t2    |  (null)OK  |     ref      | *list of indexes |  idonlyindex |   768   | db.t1.id|   892  |   0.61   | using where;                                       |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

注意:idonlyindex是仅在id字段上的索引

2 个答案:

答案 0 :(得分:0)

除了使用子查询,您还可以使用GROUP BY和基于条件HAVING的基于子句的过滤来解决此问题:

SELECT id 
FROM userlog 
WHERE isSample = '0' 
GROUP BY id 
HAVING 
  /* No activity in last 30 days */
  NOT SUM(created >= '2019-07-13 23:59:59' - INTERVAL 30 DAY 
          AND created <= '2019-07-13 23:59:59') 
  AND 
  /* Activity in 7 days prior to last 30 days */
  SUM(created >= '2019-07-07 00:00:00' - INTERVAL 30 DAY
      AND created <= '2019-07-13 23:59:59' - INTERVAL 30 DAY)

另一种方法可以利用Correlated Subqueries

SELECT 
  DISTINCT t1.id
FROM userlog AS t1
WHERE t1.isSample = '0' 
  AND t1.created >= '2019-07-07 00:00:00' - INTERVAL 30 DAY
  AND t1.created <= '2019-07-13 23:59:59' - INTERVAL 30 DAY
  AND NOT EXISTS (SELECT 1 
                  FROM userlog AS t2 
                  WHERE t2.id = t1.id 
                    AND t2.isSample = '0' 
                    AND t2.created >= '2019-07-13 23:59:59' - INTERVAL 30 DAY 
                    AND t2.created <= '2019-07-13 23:59:59')

尝试两个查询,并检查哪个查询更有效。您可能还需要在(isSample, id, created)

上定义一个综合索引

答案 1 :(得分:0)

可能是这样吗?

SELECT DISTINCT  id
 FROM userlog 
    WHERE 
        (  created >= '2019-07-07 00:00:00' - INTERVAL 30 DAY 
                 AND created <= '2019-07-13 23:59:59' - INTERVAL 30 DAY 
                 AND isSample = 0
         )

    AND    name NOT IN 
         ( 
             SELECT u1.name  
              FROM userlog as u1
             WHERE u1created >= '2019-07-13 23:59:59' - INTERVAL 30 DAY 
                AND u1created <= '2019-07-13 23:59:59' AND u1.isSample = 0
         )


如果您使用name列进行过滤,则添加索引会很好。 添加括号是为了覆盖逻辑以独立于第二逻辑进行处理。