我开发了一个搜索引擎,允许用户通过数十万个文档搜索单词。
所有文件都分为单词。所有单词都存储在表格中,#34;单词",并在表格中添加一行" urlword"对于文档中找到的每个单词。专栏"得分"用tf-idf算法计算。
要搜索一个单词,查询简单快捷:
SELECT url_id FROM urlword WHERE word_id = [word id] ORDER BY score DESC LIMIT 10
但是,当我们想要查找包含多个单词的文档时,它会呈指数级变慢。例如:
select url_id from urlword uw
where word_id in (21,28)
group by url_id
having count(*) = 2
order by sum(score) DESC
limit 10
或
select uw1.url_id from urlword uw1
inner join urlword uw2 on uw2.word_id=28 and uw1.url_id=uw2.url_id
where uw1.word_id=21
order by uw1.score+uw2.score DESC
limit 10
非常慢,几十秒,当这些单词出现很多时。
有没有更优化的方法来使用mysql?
是弹性搜索的唯一方法吗?甚至弹性搜索在这里表现还算不错?或者也许是另一种工具?
mysql架构:
CREATE TABLE `url` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`url` text COLLATE utf8_unicode_ci,
`urlhash` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`title` text COLLATE utf8_unicode_ci,
PRIMARY KEY (`id`),
UNIQUE KEY `urlhash` (`urlhash`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
CREATE TABLE `urlword` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`count` int(11) DEFAULT NULL,
`score` decimal(20,6) DEFAULT NULL,
`url_id` int(11) DEFAULT NULL,
`word_id` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `UNIQ_DBA1BC7981CFDAE7E357438D` (`url_id`,`word_id`),
KEY `IDX_DBA1BC7932993751` (`score`),
KEY `word_id_url_id_score` (`word_id`,`url_id`,`score`),
KEY `word_id_score` (`word_id`,`score`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
CREATE TABLE `word` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`word` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `UNIQ_C3F17511C3F17511` (`word`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;