我有一个带有2个整数字段的mysql表“items”:seid和tiid
该表有大约35000000条记录,所以它非常大。
seid tiid
-----------
1 1
2 2
2 3
2 4
3 4
4 1
4 2
该表在两个字段上都有一个主键,seid上的索引和tiid上的索引。
有人输入一个或多个tiid值,现在我想得到最多结果的seid。
例如,当有人输入1,2,3时,我希望得到2和4的结果。他们在tiid值上都有2个匹配。
到目前为止我的查询:
SELECT COUNT(*) as c, seid
FROM items
WHERE tiid IN (1,2,3)
GROUP BY seid
HAVING c = (SELECT COUNT(*) as c, seid
FROM items
WHERE tiid IN (1,2,3)
GROUP BY seid
ORDER BY c DESC
LIMIT 1)
但是这个查询极其缓慢,因为有大表。
有没有人知道如何为此目的构建更好的查询?
答案 0 :(得分:2)
这需要你遍历大表两次。 也许缓存结果将有助于将所花费的时间减半,但看起来似乎没有更多的选择。
DROP temporary table if exists TMP_COUNTED;
create temporary table TMP_COUNTED
select seid, COUNT(*) as C
from items
where tiid in (1,2,3)
group by seid;
CREATE INDEX IX_TMP_COUNTED on TMP_COUNTED(C);
SELECT *
FROM TMP_COUNTED
WHERE C = (SELECT MAX(C) FROM seid)
答案 1 :(得分:2)
所以我找到了2个解决方案,第1个:
SELECT c,GROUP_CONCAT(CAST(seid AS CHAR)) as seid_list
FROM (
SELECT COUNT(*) as c, seid FROM items
WHERE tiid IN (1,2,3)
GROUP BY seid ORDER BY c DESC
) T1
GROUP BY c
ORDER BY c DESC
LIMIT 1;
+---+-----------+
| c | seid_list |
+---+-----------+
| 2 | 2,4 |
+---+-----------+
修改强>
EXPLAIN SELECT c,GROUP_CONCAT(CAST(seid AS CHAR)) as seid_list FROM ( SELECT COUNT(*) as c, seid FROM items WHERE tiid IN (1,2,3) GROUP BY seid ORDER BY c DESC ) T1 GROUP BY c ORDER BY c DESC LIMIT 1;
+----+-------------+------------+-------+------------------+---------+---------+------+------+-----------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+-------+------------------+---------+---------+------+------+-----------------------------------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 3 | Using filesort |
| 2 | DERIVED | items | range | PRIMARY,tiid_idx | PRIMARY | 4 | NULL | 4 | Using where; Using index; Using temporary; Using filesort |
+----+-------------+------------+-------+------------------+---------+---------+------+------+-----------------------------------------------------------+
重新编辑:
第一个解决方案有一个问题,数十亿行的结果字段可能太大。所以这是另一个解决方案,它通过对MySQl变量应用clasical max memorisation / check来避免双彩虹效果:
SELECT c,seid
FROM (
SELECT c,seid,CASE WHEN @mmax<=c THEN @mmax:=c ELSE 0 END 'mymax'
FROM (
SELECT COUNT(*) as c, seid FROM items WHERE tiid IN (1,2,3)
GROUP BY seid
ORDER BY c DESC
) res1
,(SELECT @mmax:=0) initmax
ORDER BY c DESC
) res2 WHERE mymax>0;
+---+------+
| c | seid |
+---+------+
| 2 | 4 |
| 2 | 2 |
+---+------+
解释
+----+-------------+------------+--------+------------------+---------+---------+------+------+-----------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+--------+------------------+---------+---------+------+------+-----------------------------------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 3 | Using where |
| 2 | DERIVED | <derived4> | system | NULL | NULL | NULL | NULL | 1 | Using filesort |
| 2 | DERIVED | <derived3> | ALL | NULL | NULL | NULL | NULL | 3 | |
| 4 | DERIVED | NULL | NULL | NULL | NULL | NULL | NULL | NULL | No tables used |
| 3 | DERIVED | items | range | PRIMARY,tiid_idx | PRIMARY | 4 | NULL | 4 | Using where; Using index; Using temporary; Using filesort |
+----+-------------+------------+--------+------------------+---------+---------+------+------+-----------------------------------------------------------+
答案 2 :(得分:1)
预先计算所有唯一tiid值的计数并存储它们。
每小时,每天或每周刷新此计数。或者尝试通过更新来保持计数正确。 这将消除进行计数的需要。计数总是很慢。
答案 3 :(得分:1)
我有一个名为product_category的表,它有一个复合主键,由2个无符号整数字段组成,没有其他二级索引:
create table product_category
(
prod_id int unsigned not null,
cat_id mediumint unsigned not null,
primary key (cat_id, prod_id) -- note the clustered composite index !!
)
engine = innodb;
该表目前有1.25亿行
select count(*) as c from product_category;
c
=
125,524,947
具有以下索引/基数:
show indexes from product_category;
Table Non_unique Key_name Seq_in_index Column_name Collation Cardinality
===== ========== ======== ============ =========== ========= ===========
product_category 0 PRIMARY 1 cat_id A 1162276
product_category 0 PRIMARY 2 prod_id A 125525826
如果我运行类似于你的查询(第一次运行没有缓存,使用冷/空缓冲区):
select
prod_id, count(*) as c
from
product_category
where
cat_id between 1600 and 2000 -- using between to include a wider range of data
group by
prod_id
having c = (
select count(*) as c from product_category
where cat_id between 1600 and 2000
group by prod_id order by c desc limit 1
)
order by prod_id;
我得到以下结果:
(cold run)
+---------+---+
| prod_id | c |
+---------+---+
| 34957 | 4 |
| 717812 | 4 |
| 816612 | 4 |
| 931111 | 4 |
+---------+---+
4 rows in set (0.18 sec)
(2nd run)
+---------+---+
| prod_id | c |
+---------+---+
| 34957 | 4 |
| 717812 | 4 |
| 816612 | 4 |
| 931111 | 4 |
+---------+---+
4 rows in set (0.14 sec)
解释计划如下:
+----+-------------+------------------+-------+---------------+---------+---------+------+--------+-----------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------------+-------+---------------+---------+---------+------+--------+-----------------------------------------------------------+
| 1 | PRIMARY | product_category | range | PRIMARY | PRIMARY | 3 | NULL | 194622 | Using where; Using index; Using temporary; Using filesort |
| 2 | SUBQUERY | product_category | range | PRIMARY | PRIMARY | 3 | NULL | 194622 | Using where; Using index; Using temporary; Using filesort |
+----+-------------+------------------+-------+---------------+---------+---------+------+--------+-----------------------------------------------------------+
如果我运行regilero的查询:
SELECT c,prod_id
FROM (
SELECT c,prod_id,CASE WHEN @mmax<=c THEN @mmax:=c ELSE 0 END 'mymax'
FROM (
SELECT COUNT(*) as c, prod_id FROM product_category WHERE
cat_id between 1600 and 2000
GROUP BY prod_id
ORDER BY c DESC
) res1
,(SELECT @mmax:=0) initmax
ORDER BY c DESC
) res2 WHERE mymax>0;
我得到以下结果:
(cold)
+---+---------+
| c | prod_id |
+---+---------+
| 4 | 931111 |
| 4 | 34957 |
| 4 | 717812 |
| 4 | 816612 |
+---+---------+
4 rows in set (0.17 sec)
(2nd run)
+---+---------+
| c | prod_id |
+---+---------+
| 4 | 34957 |
| 4 | 717812 |
| 4 | 816612 |
| 4 | 931111 |
+---+---------+
4 rows in set (0.13 sec)
解释计划如下:
+----+-------------+------------------+--------+---------------+---------+---------+------+--------+-----------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------------+--------+---------------+---------+---------+------+--------+-----------------------------------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 92760 | Using where |
| 2 | DERIVED | <derived4> | system | NULL | NULL | NULL | NULL | 1 | Using filesort |
| 2 | DERIVED | <derived3> | ALL | NULL | NULL | NULL | NULL | 92760 | |
| 4 | DERIVED | NULL | NULL | NULL | NULL | NULL | NULL | NULL | No tables used |
| 3 | DERIVED | product_category | range | PRIMARY | PRIMARY | 3 | NULL | 194622 | Using where; Using index; Using temporary; Using filesort |
+----+-------------+------------------+--------+---------------+---------+---------+------+--------+-----------------------------------------------------------+
最后尝试使用cyberwiki的方法:
drop procedure if exists cyberkiwi_variant;
delimiter #
create procedure cyberkiwi_variant()
begin
create temporary table tmp engine=memory
select prod_id, count(*) as c from
product_category where cat_id between 1600 and 2000
group by prod_id order by c desc;
select max(c) into @max from tmp;
select * from tmp where c = @max;
drop temporary table if exists tmp;
end#
delimiter ;
call cyberkiwi_variant();
我得到以下结果:
(cold and 2nd run)
+---------+---+
| prod_id | c |
+---------+---+
| 816612 | 4 |
| 931111 | 4 |
| 34957 | 4 |
| 717812 | 4 |
+---------+---+
4 rows in set (0.14 sec)
解释计划如下:
+----+-------------+------------------+-------+---------------+---------+---------+------+--------+-----------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------------+-------+---------------+---------+---------+------+--------+-----------------------------------------------------------+
| 1 | SIMPLE | product_category | range | PRIMARY | PRIMARY | 3 | NULL | 194622 | Using where; Using index; Using temporary; Using filesort |
+----+-------------+------------------+-------+---------------+---------+---------+------+--------+-----------------------------------------------------------+
所以测试的所有方法似乎都有。相同的运行时间介于0.14和0.18秒之间,考虑到表的大小和查询的行数,这对我来说似乎非常有效。
希望这会有所帮助 - http://dev.mysql.com/doc/refman/5.0/en/innodb-index-types.html
答案 4 :(得分:0)
如果我理解你的要求,你可以尝试这样的事情
select seid, tiid, count(*) from items where tiid in (1,2,3)
group by seid, tiid
order by seid