优化大型关键字表?

时间:2013-07-26 09:25:01

标签: mysql search indexing keyword-search

我有一张像

这样的大表
CREATE TABLE IF NOT EXISTS `object_search` (
  `keyword` varchar(40) COLLATE latin1_german1_ci NOT NULL,
  `object_id` int(10) unsigned NOT NULL,
  PRIMARY KEY (`keyword`,`media_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COLLATE=latin1_german1_ci;

包含大约3900万行(使用超过1 GB的空间),其中包含对象表中100万条记录的索引数据(其中object_id指向)。

现在使用类似

的查询进行搜索
SELECT object_id, COUNT(object_id) AS hits
FROM object_search
WHERE keyword = 'woman' OR keyword = 'house'
GROUP BY object_id
HAVING hits = 2

已经比LIKE表中的撰写keywords字段进行object搜索快得多,但仍然需要1分钟。

它的解释如下:

+----+-------------+--------+------+---------------+---------+---------+-------+--------+----------+--------------------------+
| id | select_type | table  | type | possible_keys | key     | key_len | ref   | rows   | filtered | Extra                    |
+----+-------------+--------+------+---------------+---------+---------+-------+--------+----------+--------------------------+
|  1 | SIMPLE      | search | ref  | PRIMARY       | PRIMARY | 42      | const | 345180 |   100.00 | Using where; Using index |
+----+-------------+--------+------+---------------+---------+---------+-------+--------+----------+--------------------------+

加入objectobject_color以及object_locale表的完整解释,而上述查询在子查询中运行以避免开销,如下所示:

+----+-------------+-------------------+--------+---------------+-----------+---------+------------------+--------+----------+---------------------------------+
| id | select_type | table             | type   | possible_keys | key       | key_len | ref              | rows   | filtered | Extra                           |
+----+-------------+-------------------+--------+---------------+-----------+---------+------------------+--------+----------+---------------------------------+
|  1 | PRIMARY     | <derived2>        | ALL    | NULL          | NULL      | NULL    | NULL             | 182544 |   100.00 | Using temporary; Using filesort |
|  1 | PRIMARY     | object_color      | eq_ref | object_id     | object_id | 4       | search.object_id |      1 |   100.00 |                                 |
|  1 | PRIMARY     | locale            | eq_ref | object_id     | object_id | 4       | search.object_id |      1 |   100.00 |                                 |
|  1 | PRIMARY     | object            | eq_ref | PRIMARY       | PRIMARY   | 4       | search.object_id |      1 |   100.00 |                                 |
|  2 | DERIVED     | search            | ref    | PRIMARY       | PRIMARY   | 42      |                  | 345180 |   100.00 | Using where; Using index        |
+----+-------------+-------------------+--------+---------------+-----------+---------+------------------+--------+----------+---------------------------------+

我的首要目标是能够在1或2秒内完成扫描。

那么,还有其他技术可以提高关键字的搜索速度吗?

<小时/> 更新2013-08-06:

应用 Neville K 的大部分建议我现在有以下设置:

CREATE TABLE `object_search_keyword` (
  `keyword_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `keyword` varchar(64) COLLATE latin1_german1_ci NOT NULL,
  PRIMARY KEY (`keyword_id`),
  FULLTEXT KEY `keyword_ft` (`keyword`)
) ENGINE=MyISAM  DEFAULT CHARSET=latin1 COLLATE=latin1_german1_ci;

CREATE TABLE `object_search` (
  `keyword_id` int(10) unsigned NOT NULL,
  `object_id` int(10) unsigned NOT NULL,
  PRIMARY KEY (`keyword_id`,`media_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

新查询的解释如下:

+----+-------------+----------------+----------+--------------------+------------+---------+---------------------------+---------+----------+----------------------------------------------+
| id | select_type | table          | type     | possible_keys      | key        | key_len | ref                       | rows    | filtered | Extra                                        |
+----+-------------+----------------+----------+--------------------+------------+---------+---------------------------+---------+----------+----------------------------------------------+
|  1 | PRIMARY     | <derived2>     | ALL      | NULL               | NULL       | NULL    | NULL                      |   24381 |   100.00 | Using temporary; Using filesort              |
|  1 | PRIMARY     | object_color   | eq_ref   | object_id          | object_id  | 4       | object_search.object_id   |       1 |   100.00 |                                              |
|  1 | PRIMARY     | object         | eq_ref   | PRIMARY            | PRIMARY    | 4       | object_search.object_id   |       1 |   100.00 |                                              |
|  1 | PRIMARY     | locale         | eq_ref   | object_id          | object_id  | 4       | object_search.object_id   |       1 |   100.00 |                                              |
|  2 | DERIVED     | <derived4>     | system   | NULL               | NULL       | NULL    | NULL                      |       1 |   100.00 |                                              |
|  2 | DERIVED     | <derived3>     | ALL      | NULL               | NULL       | NULL    | NULL                      |   24381 |   100.00 |                                              |
|  4 | DERIVED     | NULL           | NULL     | NULL               | NULL       | NULL    | NULL                      |    NULL |     NULL | No tables used                               |
|  3 | DERIVED     | object_keyword | fulltext | PRIMARY,keyword_ft | keyword_ft | 0       |                           |       1 |   100.00 | Using where; Using temporary; Using filesort |
|  3 | DERIVED     | object_search  | ref      | PRIMARY            | PRIMARY    | 4       | object_keyword.keyword_id | 2190225 |   100.00 | Using index                                  |
+----+-------------+----------------+----------+--------------------+------------+---------+---------------------------+---------+----------+----------------------------------------------+

许多派生来自关键字比较子查询被嵌套到另一个子查询中,该子查询除了计算返回的行数之外什么都不做:

SELECT SQL_NO_CACHE object.object_id, ..., @rn AS numrows
FROM (
    SELECT *, @rn := @rn + 1
    FROM (
        SELECT SQL_NO_CACHE search.object_id, COUNT(turbo.object_id) AS hits
        FROM object_keyword AS kwd
        INNER JOIN object_search AS search ON (kwd.keyword_id = search.keyword_id)
        WHERE MATCH (kwd.keyword) AGAINST ('+(woman) +(house)')
        GROUP BY search.object_id HAVING hits = 2
    ) AS numrowswrapper
    CROSS JOIN (SELECT @rn := 0) CONST
) AS turbo
INNER JOIN object AS object ON (search.object_id = object.object_id)
LEFT JOIN object_color AS object_color ON (search.object_id = object_color.object_id)
LEFT JOIN object_locale AS locale ON (search.object_id = locale.object_id)
ORDER BY timestamp_upload DESC

上述查询实际上会在~6秒内运行,因为它会搜索两个关键字。我搜索的关键字越多,搜索结果就越快。

有进一步优化的方法吗?

<小时/> 更新2013-08-07

阻塞的东西几乎肯定是附加的ORDER BY语句。没有它,查询将在不到一秒的时间内执行。

那么,有没有办法更快地对结果进行排序?任何建议都欢迎,甚至是需要在其他地方进行后期处理的hackish。

<小时/> 当天晚些时候更新2013-08-07

女士们,先生们,将WHEREORDER BY语句嵌套在另一层子查询中,不要让它对表格感到困扰,它不需要再次将其性能提高一倍:

SELECT wowrapper.*, locale.title
FROM (
    SELECT SQL_NO_CACHE object.object_id, ..., @rn AS numrows
    FROM (
        SELECT *, @rn := @rn + 1
        FROM (
            SELECT SQL_NO_CACHE search.media_id, COUNT(search.media_id) AS hits
            FROM object_keyword AS kwd
            INNER JOIN object_search AS search ON (kwd.keyword_id = search.keyword_id)
            WHERE MATCH (kwd.keyword) AGAINST ('+(frau)')
            GROUP BY search.media_id HAVING hits = 1
        ) AS numrowswrapper
        CROSS JOIN (SELECT @rn := 0) CONST
    ) AS search 
    INNER JOIN object AS object ON (search.object_id = object.object_id) 
    LEFT JOIN object_color AS color ON (search.object_id = color.object_id)
    WHERE 1
    ORDER BY object.object_id DESC
) AS wowrapper 
LEFT JOIN object_locale AS locale ON (jfwrapper.object_id = locale.object_id) 
LIMIT 0,48

花费12秒(单个关键字,约200K结果)的搜索现在占用6,搜索两个占用6秒(60K结果)的关键字现在大约需要3.5秒。

现在这已经是一个巨大的进步,但有没有机会进一步推动这一进程?

<小时/> 当天早些时候更新2013-08-08

Undid查询的最后一个嵌套变体,因为它实际上减慢了它的其他变体...... 我现在正在尝试使用MyISAM使用MyISAM的不同表格布局和FULLTEXT索引的其他一些内容,以获得具有组合关键字字段的专用搜索表(逗号在TEXT字段中分隔)。

<小时/> 更新2013-08-08

好吧,简单的全文索引并没有真正帮助。

回到之前的设置,唯一阻塞的是ORDER BY(它使用临时表和filesort)。没有它,搜索在不到一秒的时间内完成!

所以基本上所剩下的就是:
如何通过消除临时表的使用来优化ORDER BY语句以更快地运行?

3 个答案:

答案 0 :(得分:1)

Full text search将比使用标准SQL字符串比较功能快得多。

其次,如果关键字中存在高度冗余,您可以考虑“多对多”实现:

Keywords
--------
keyword_id
keyword

keyword_object
-------------
keyword_id
object_id

objects
-------
object_id
......

如果这将字符串比较从3900万行减少到100K行(大致相当于英文字典的大小),您可能还会看到明显的改进,因为查询只需执行100K字符串比较,并加入整数keyword_id和object_id字段应该比进行39M字符串比较快得多。

答案 1 :(得分:0)

对此最佳解决方案是FULLTEXT搜索,但您可能需要一个MyISAM表。您可以设置镜像表并使用某些事件和触发器进行更新,或者如果您从服务器复制了从属服务器,则可以将其表更改为MyISAM并将其用于搜索。

对于此查询,我唯一能想到的就是将其重写为:

SELECT s1.object_id
FROM object_search s1
JOIN object_search s2 ON s2.object_id = s1.object_id AND s2.key_word = 'word2'
JOIN object_search s3 ON s3.object_id = s1.object_id AND s3.key_word = 'word3'
....
WHERE s1.key_word = 'word1'

我不确定这种方式会更快。

此外,您需要在object_id上有一个索引(假设您的PK为(key_word, object_id))。

答案 2 :(得分:0)

如果您很少使用INSERT并经常使用SELECT,则可以针对读取优化数据,即重新计算每个关键字的object_id数量并直接将其存储在数据库中。然后SELECT会非常快,INSERT将需要几秒钟,。