获取最相似的行并按相似性排序 - 性能提升

时间:2015-05-01 08:14:30

标签: mysql sql database database-performance query-performance

我的items表的结构与此类似:

id
user_id
feature_1 
feature_2
feature_3
...
feature_20

大多数feature...字段都是数字,其中3-4个字段包含文字。

现在我需要查找最相似的项目项目(具有一定权重的完全相同的字段)并按相似性排序。

我可以这样做:

select (IF (feature_1 = 'xxx1', 100, 0) +  
        IF (feature_2 = 'xxx2', 100, 0) + 
        IF (feature_3 = 'xxx3', 100, 0) + 
        IF (feature_4 = 'xxx4', 1, 0) + 
        ...  + 
        IF (feature_20 = 'xxx20', 1, 0)) 
        AS score, id from `items` where `id` <> 'yyy' 
        group by `id` having `score` > '0' order by `score` desc;

当然代替xxx我把这个字段的有效值放在我要比较的项目上,代替yyy我把项目的id比较(我不想包括它)结果)。对于每个字段,我可以指定我想要用于相似性的权重(此处为前三个100,其余为1)

Getting most similar rows in MySQL table and order them by similarity

中使用了完全相同的技术

现在来了表演。我生成了大约有100000个项目的表。查找一个项目的类似项目大约需要0.4 second。即使我可以降低我需要在比较中包含的feature_字段的数量(并且我可能不允许这样做),这样的集合将需要大约0.16-0.2 second

现在情况会更糟。我需要为属于一个用户的所有项找到类似的项。假设用户有100个项目。我需要从DB中获取所有内容,运行上面的100个查询,然后按分数对所有内容进行排序并删除重复项(在PHP中但这不是问题)然后再次显示整个记录(当然最终结果将被分页)。

所以:

  • 我需要运行100多个查询来实现这一点(我不知道是否可以在xxx个地方没有显式放置值的情况下运行此类查询)
  • 实现该目标需要100 x 0.4秒= 40秒

问题:

  • 是否可以改进上述查询(使用索引或重建它)以使其运行得更快
  • 是否可以重建查询以获取类似的项目,不是针对一个项目,而是针对许多项目(一个用户的所有项目)

我还需要补充一点,并非所有项目都填充了feature个字段(它们是nullable),所以如果我查找具有例如feature_15字段{{1}的项目的类似项目我根本不想将此null字段添加到feature_15,因为此项目未知。

修改

我按照 @pala (下面的数据库结构)的建议创建了结构。现在我在score表中有25条记录,features表中有2138959条(是,超过2百万条)记录。

当我运行示例查询时:

feature_watch

现在需要在select if2.watch_id, sum(f.weight) AS `sum` from feature_watch if1 inner join feature_watch if2 on if1.feature_id = if2.feature_id and if1.feature_value = if2.feature_value and if1.watch_id <> if2.watch_id inner join features f on if2.feature_id = f.id where if1.watch_id = 71 group by if2.watch_id ORDER BY sum DESC 之间获得相同的结果。我在这里错过了什么吗?

1-2 seconds

EDIT2

对于以下查询:

CREATE TABLE IF NOT EXISTS `features` (
`id` int(10) unsigned NOT NULL,
  `name` varchar(100) COLLATE utf8_unicode_ci NOT NULL,
  `weight` tinyint(3) unsigned NOT NULL,
  `created_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
  `updated_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00'
) ENGINE=InnoDB AUTO_INCREMENT=26 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

CREATE TABLE IF NOT EXISTS `feature_watch` (
`id` int(10) unsigned NOT NULL,
  `feature_id` int(10) unsigned NOT NULL,
  `watch_id` int(10) unsigned NOT NULL,
  `user_id` int(10) unsigned NOT NULL,
  `feature_value` varchar(150) COLLATE utf8_unicode_ci DEFAULT NULL
) ENGINE=InnoDB AUTO_INCREMENT=2142999 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

ALTER TABLE `features`
 ADD PRIMARY KEY (`id`), ADD UNIQUE KEY `features_name_unique` (`name`), ADD KEY `weight` (`weight`);

ALTER TABLE `feature_watch`
 ADD PRIMARY KEY (`id`), ADD KEY `feature_watch_user_id_foreign` (`user_id`), ADD KEY `feature_id` (`feature_id`,`feature_value`), ADD KEY `watch_id` (`watch_id`);

ALTER TABLE `features`
MODIFY `id` int(10) unsigned NOT NULL AUTO_INCREMENT,AUTO_INCREMENT=26;

ALTER TABLE `feature_watch`
MODIFY `id` int(10) unsigned NOT NULL AUTO_INCREMENT,AUTO_INCREMENT=2142999;

ALTER TABLE `feature_watch`
ADD CONSTRAINT `feature_watch_feature_id_foreign` FOREIGN KEY (`feature_id`) REFERENCES `features` (`id`),
ADD CONSTRAINT `feature_watch_user_id_foreign` FOREIGN KEY (`user_id`) REFERENCES `users` (`id`) ON DELETE CASCADE,
ADD CONSTRAINT `feature_watch_watch_id_foreign` FOREIGN KEY (`watch_id`) REFERENCES `watches` (`id`) ON DELETE CASCADE;

select if2.watch_id, sum(f.weight) AS `sum` from feature_watch if1 inner join feature_watch if2 on if1.feature_id = if2.feature_id and if1.feature_value = if2.feature_value and if1.watch_id <> if2.watch_id inner join features f on if2.feature_id = f.id where if1.watch_id = 71 AND if2.`user_id` in (select `id` from `users` where `is_private` = '0') and if2.`user_id` <> '1' group by if2.watch_id ORDER BY sum DESC 给出:

EXPLAIN

以上查询在id select_type table type possible_keys key key_len ref rows Extra 1 SIMPLE if1 ref watch_id,compound,feature_id watch_id 4 const 22 Using where; Using temporary; Using filesort 1 SIMPLE f eq_ref PRIMARY PRIMARY 4 watches10.if1.feature_id 1 NULL 1 SIMPLE if2 ref watch_id,compound,feature_id,user_id compound 457 watches10.if1.feature_id,watches10.if1.feature_val... 441 Using where; Using index 1 SIMPLE users eq_ref PRIMARY PRIMARY 4 watches10.if2.user_id 1 Using where 上执行,如果我想运行它超过记录ID 71(例如10个记录ID),它将执行约x倍慢(10 ids约5秒)

1 个答案:

答案 0 :(得分:2)

我建议你重新组织你的表结构,类似于以下内容:

create table items (id integer primary key auto_increment);

create table features (
  id integer primary key auto_increment,
  feature_name varchar(25),
  feature_weight integer
);

create table item_features (  
  item_id integer,
  feature_id integer,  
  feature_value varchar(25)
);

这将允许您运行一个相对简单的查询来计算基于特征的相似性,通过总计它们的权重。

select if2.item_id, sum(f.feature_weight)
  from item_features if1
    inner join item_features if2
      on if1.feature_id = if2.feature_id
        and if1.feature_value = if2.feature_value
        and if1.item_id <> if2.item_id
    inner join features f
      on if2.feature_id = f.id
   where if1.item_id = 1
   group by if2.item_id

这里有一个演示:http://sqlfiddle.com/#!9/613970/4

我知道它与问题中的表定义不匹配 - 但是表中的重复值是通向黑暗面的路径。规范化确实让生活更轻松。

索引位于item_features(feature_id, feature_value)以及features(feature_name)上,查询应该非常快