Question

我有一个带有2个整数字段的mysql表“items”：seid和tiid
该表有大约35000000条记录，所以它非常大。

seid  tiid  
-----------
1     1  
2     2  
2     3  
2     4  
3     4  
4     1  
4     2

该表在两个字段上都有一个主键，seid上的索引和tiid上的索引。

有人输入一个或多个tiid值，现在我想得到最多结果的seid。

例如，当有人输入1,2,3时，我希望得到2和4的结果。他们在tiid值上都有2个匹配。

到目前为止我的查询：

SELECT COUNT(*) as c, seid
  FROM items
 WHERE tiid IN (1,2,3) 
GROUP BY seid
HAVING c = (SELECT COUNT(*) as c, seid
              FROM items
             WHERE tiid IN (1,2,3) 
          GROUP BY seid
          ORDER BY c DESC 
             LIMIT 1)

但是这个查询极其缓慢，因为有大表。

有没有人知道如何为此目的构建更好的查询？

Answer 1

这需要你遍历大表两次。也许缓存结果将有助于将所花费的时间减半，但看起来似乎没有更多的选择。

DROP temporary table if exists TMP_COUNTED;

create temporary table TMP_COUNTED
select seid, COUNT(*) as C
from items
where tiid in (1,2,3)
group by seid;

CREATE INDEX IX_TMP_COUNTED on TMP_COUNTED(C);

SELECT *
FROM TMP_COUNTED
WHERE C = (SELECT MAX(C) FROM seid)

Answer 2

所以我找到了2个解决方案，第1个：

SELECT c,GROUP_CONCAT(CAST(seid AS CHAR)) as seid_list 
FROM (
    SELECT COUNT(*) as c, seid FROM items 
    WHERE tiid IN (1,2,3) 
    GROUP BY seid ORDER BY c DESC
) T1 
GROUP BY c 
ORDER BY c DESC
LIMIT 1;
+---+-----------+
| c | seid_list |
+---+-----------+
| 2 | 2,4       | 
+---+-----------+

修改

Answer 3

预先计算所有唯一tiid值的计数并存储它们。

每小时，每天或每周刷新此计数。或者尝试通过更新来保持计数正确。这将消除进行计数的需要。计数总是很慢。

Answer 4

我有一个名为product_category的表，它有一个复合主键，由2个无符号整数字段组成，没有其他二级索引：

create table product_category
(
prod_id int unsigned not null,
cat_id mediumint unsigned not null,
primary key (cat_id, prod_id) -- note the clustered composite index !!
)
engine = innodb;

该表目前有1.25亿行

select count(*) as c from product_category;
c
=
125,524,947

具有以下索引/基数：

show indexes from product_category;

Table              Non_unique   Key_name    Seq_in_index    Column_name Collation   Cardinality
=====              ==========   ========    ============    =========== =========   ===========
product_category    0            PRIMARY                1    cat_id      A           1162276
product_category    0            PRIMARY                2    prod_id     A           125525826

如果我运行类似于你的查询（第一次运行没有缓存，使用冷/空缓冲区）：

select 
 prod_id, count(*) as c
from
 product_category 
where 
  cat_id between 1600 and 2000 -- using between to include a wider range of data
group by
 prod_id 
having c = (
  select count(*) as c from product_category 
  where cat_id between 1600 and 2000
  group by  prod_id order by c desc limit 1
)
order by prod_id;

我得到以下结果：

(cold run)
+---------+---+
| prod_id | c |
+---------+---+
|   34957 | 4 |
|  717812 | 4 |
|  816612 | 4 |
|  931111 | 4 |
+---------+---+
4 rows in set (0.18 sec)

(2nd run)
+---------+---+
| prod_id | c |
+---------+---+
|   34957 | 4 |
|  717812 | 4 |
|  816612 | 4 |
|  931111 | 4 |
+---------+---+
4 rows in set (0.14 sec)

解释计划如下：

+----+-------------+------------------+-------+---------------+---------+---------+------+--------+-----------------------------------------------------------+
| id | select_type | table            | type  | possible_keys | key     | key_len | ref  | rows   | Extra                                                     |
+----+-------------+------------------+-------+---------------+---------+---------+------+--------+-----------------------------------------------------------+
|  1 | PRIMARY     | product_category | range | PRIMARY       | PRIMARY | 3       | NULL | 194622 | Using where; Using index; Using temporary; Using filesort |
|  2 | SUBQUERY    | product_category | range | PRIMARY       | PRIMARY | 3       | NULL | 194622 | Using where; Using index; Using temporary; Using filesort |
+----+-------------+------------------+-------+---------------+---------+---------+------+--------+-----------------------------------------------------------+

如果我运行regilero的查询：

SELECT c,prod_id
  FROM (
   SELECT c,prod_id,CASE WHEN @mmax<=c THEN @mmax:=c ELSE 0 END 'mymax'
     FROM (
       SELECT COUNT(*) as c, prod_id FROM product_category WHERE
        cat_id between 1600 and 2000
        GROUP BY prod_id
        ORDER BY c DESC
    ) res1
   ,(SELECT @mmax:=0) initmax
   ORDER BY c DESC
 ) res2 WHERE mymax>0;

我得到以下结果：

(cold) 
+---+---------+
| c | prod_id |
+---+---------+
| 4 |  931111 |
| 4 |   34957 |
| 4 |  717812 |
| 4 |  816612 |
+---+---------+
4 rows in set (0.17 sec)

(2nd run)
+---+---------+
| c | prod_id |
+---+---------+
| 4 |   34957 |
| 4 |  717812 |
| 4 |  816612 |
| 4 |  931111 |
+---+---------+
4 rows in set (0.13 sec)

解释计划如下：

+----+-------------+------------------+--------+---------------+---------+---------+------+--------+-----------------------------------------------------------+
| id | select_type | table            | type   | possible_keys | key     | key_len | ref  | rows   | Extra                                                     |
+----+-------------+------------------+--------+---------------+---------+---------+------+--------+-----------------------------------------------------------+
|  1 | PRIMARY     | <derived2>       | ALL    | NULL          | NULL    | NULL   | NULL |  92760 | Using where                                               |
|  2 | DERIVED     | <derived4>       | system | NULL          | NULL    | NULL   | NULL |      1 | Using filesort                                            |
|  2 | DERIVED     | <derived3>       | ALL    | NULL          | NULL    | NULL   | NULL |  92760 |                                                           |
|  4 | DERIVED     | NULL             | NULL   | NULL          | NULL    | NULL   | NULL |   NULL | No tables used                                            |
|  3 | DERIVED     | product_category | range  | PRIMARY       | PRIMARY | 3      | NULL | 194622 | Using where; Using index; Using temporary; Using filesort   |
+----+-------------+------------------+--------+---------------+---------+---------+------+--------+-----------------------------------------------------------+

最后尝试使用cyberwiki的方法：

drop procedure if exists cyberkiwi_variant;

delimiter #

create procedure cyberkiwi_variant()
begin

create temporary table tmp engine=memory
 select prod_id, count(*) as c from
 product_category where cat_id between 1600 and 2000
 group by prod_id order by c desc; 

select max(c) into @max from tmp;

select * from tmp where c = @max;

drop temporary table if exists tmp;

end#

delimiter ;

call cyberkiwi_variant();

我得到以下结果：

(cold and 2nd run)
+---------+---+
| prod_id | c |
+---------+---+
|  816612 | 4 |
|  931111 | 4 |
|   34957 | 4 |
|  717812 | 4 |
+---------+---+
4 rows in set (0.14 sec)

解释计划如下：

+----+-------------+------------------+-------+---------------+---------+---------+------+--------+-----------------------------------------------------------+
| id | select_type | table            | type  | possible_keys | key     | key_len | ref  | rows   | Extra                                                     |
+----+-------------+------------------+-------+---------------+---------+---------+------+--------+-----------------------------------------------------------+
|  1 | SIMPLE      | product_category | range | PRIMARY       | PRIMARY | 3  | NULL | 194622 | Using where; Using index; Using temporary; Using filesort |
+----+-------------+------------------+-------+---------------+---------+---------+------+--------+-----------------------------------------------------------+

所以测试的所有方法似乎都有。相同的运行时间介于0.14和0.18秒之间，考虑到表的大小和查询的行数，这对我来说似乎非常有效。

希望这会有所帮助 - http://dev.mysql.com/doc/refman/5.0/en/innodb-index-types.html

Answer 5

如果我理解你的要求，你可以尝试这样的事情

select seid, tiid, count(*) from items where tiid in (1,2,3)
group by seid, tiid
order by seid

在大表的查询中获取计数匹配非常慢

5 个答案: