我的items
表的结构与此类似:
id
user_id
feature_1
feature_2
feature_3
...
feature_20
大多数feature...
字段都是数字,其中3-4个字段包含文字。
现在我需要查找最相似的项目项目(具有一定权重的完全相同的字段)并按相似性排序。
我可以这样做:
select (IF (feature_1 = 'xxx1', 100, 0) +
IF (feature_2 = 'xxx2', 100, 0) +
IF (feature_3 = 'xxx3', 100, 0) +
IF (feature_4 = 'xxx4', 1, 0) +
... +
IF (feature_20 = 'xxx20', 1, 0))
AS score, id from `items` where `id` <> 'yyy'
group by `id` having `score` > '0' order by `score` desc;
当然代替xxx
我把这个字段的有效值放在我要比较的项目上,代替yyy
我把项目的id比较(我不想包括它)结果)。对于每个字段,我可以指定我想要用于相似性的权重(此处为前三个100,其余为1)
在Getting most similar rows in MySQL table and order them by similarity
中使用了完全相同的技术现在来了表演。我生成了大约有100000个项目的表。查找一个项目的类似项目大约需要0.4 second
。即使我可以降低我需要在比较中包含的feature_字段的数量(并且我可能不允许这样做),这样的集合将需要大约0.16-0.2 second
。
现在情况会更糟。我需要为属于一个用户的所有项找到类似的项。假设用户有100个项目。我需要从DB中获取所有内容,运行上面的100个查询,然后按分数对所有内容进行排序并删除重复项(在PHP中但这不是问题)然后再次显示整个记录(当然最终结果将被分页)。
所以:
xxx
个地方没有显式放置值的情况下运行此类查询)问题:
我还需要补充一点,并非所有项目都填充了feature
个字段(它们是nullable
),所以如果我查找具有例如feature_15字段{{1}的项目的类似项目我根本不想将此null
字段添加到feature_15
,因为此项目未知。
修改
我按照 @pala (下面的数据库结构)的建议创建了结构。现在我在score
表中有25条记录,features
表中有2138959
条(是,超过2百万条)记录。
当我运行示例查询时:
feature_watch
现在需要在select if2.watch_id, sum(f.weight) AS `sum` from feature_watch if1
inner join feature_watch if2 on if1.feature_id = if2.feature_id
and if1.feature_value = if2.feature_value
and if1.watch_id <> if2.watch_id
inner join features f on if2.feature_id = f.id
where if1.watch_id = 71 group by if2.watch_id ORDER BY sum DESC
之间获得相同的结果。我在这里错过了什么吗?
1-2 seconds
EDIT2
对于以下查询:
CREATE TABLE IF NOT EXISTS `features` (
`id` int(10) unsigned NOT NULL,
`name` varchar(100) COLLATE utf8_unicode_ci NOT NULL,
`weight` tinyint(3) unsigned NOT NULL,
`created_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
`updated_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00'
) ENGINE=InnoDB AUTO_INCREMENT=26 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
CREATE TABLE IF NOT EXISTS `feature_watch` (
`id` int(10) unsigned NOT NULL,
`feature_id` int(10) unsigned NOT NULL,
`watch_id` int(10) unsigned NOT NULL,
`user_id` int(10) unsigned NOT NULL,
`feature_value` varchar(150) COLLATE utf8_unicode_ci DEFAULT NULL
) ENGINE=InnoDB AUTO_INCREMENT=2142999 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
ALTER TABLE `features`
ADD PRIMARY KEY (`id`), ADD UNIQUE KEY `features_name_unique` (`name`), ADD KEY `weight` (`weight`);
ALTER TABLE `feature_watch`
ADD PRIMARY KEY (`id`), ADD KEY `feature_watch_user_id_foreign` (`user_id`), ADD KEY `feature_id` (`feature_id`,`feature_value`), ADD KEY `watch_id` (`watch_id`);
ALTER TABLE `features`
MODIFY `id` int(10) unsigned NOT NULL AUTO_INCREMENT,AUTO_INCREMENT=26;
ALTER TABLE `feature_watch`
MODIFY `id` int(10) unsigned NOT NULL AUTO_INCREMENT,AUTO_INCREMENT=2142999;
ALTER TABLE `feature_watch`
ADD CONSTRAINT `feature_watch_feature_id_foreign` FOREIGN KEY (`feature_id`) REFERENCES `features` (`id`),
ADD CONSTRAINT `feature_watch_user_id_foreign` FOREIGN KEY (`user_id`) REFERENCES `users` (`id`) ON DELETE CASCADE,
ADD CONSTRAINT `feature_watch_watch_id_foreign` FOREIGN KEY (`watch_id`) REFERENCES `watches` (`id`) ON DELETE CASCADE;
select if2.watch_id, sum(f.weight) AS `sum` from feature_watch if1 inner join feature_watch if2 on if1.feature_id = if2.feature_id and if1.feature_value = if2.feature_value and if1.watch_id <> if2.watch_id inner join features f on if2.feature_id = f.id where if1.watch_id = 71 AND if2.`user_id` in (select `id` from `users` where `is_private` = '0') and if2.`user_id` <> '1' group by if2.watch_id ORDER BY sum DESC
给出:
EXPLAIN
以上查询在id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE if1 ref watch_id,compound,feature_id watch_id 4 const 22 Using where; Using temporary; Using filesort
1 SIMPLE f eq_ref PRIMARY PRIMARY 4 watches10.if1.feature_id 1 NULL
1 SIMPLE if2 ref watch_id,compound,feature_id,user_id compound 457 watches10.if1.feature_id,watches10.if1.feature_val... 441 Using where; Using index
1 SIMPLE users eq_ref PRIMARY PRIMARY 4 watches10.if2.user_id 1 Using where
上执行,如果我想运行它超过记录ID 71(例如10个记录ID),它将执行约x倍慢(10 ids约5秒)
答案 0 :(得分:2)
我建议你重新组织你的表结构,类似于以下内容:
create table items (id integer primary key auto_increment);
create table features (
id integer primary key auto_increment,
feature_name varchar(25),
feature_weight integer
);
create table item_features (
item_id integer,
feature_id integer,
feature_value varchar(25)
);
这将允许您运行一个相对简单的查询来计算基于特征的相似性,通过总计它们的权重。
select if2.item_id, sum(f.feature_weight)
from item_features if1
inner join item_features if2
on if1.feature_id = if2.feature_id
and if1.feature_value = if2.feature_value
and if1.item_id <> if2.item_id
inner join features f
on if2.feature_id = f.id
where if1.item_id = 1
group by if2.item_id
这里有一个演示:http://sqlfiddle.com/#!9/613970/4
我知道它与问题中的表定义不匹配 - 但是表中的重复值是通向黑暗面的路径。规范化确实让生活更轻松。
索引位于item_features(feature_id, feature_value)
以及features(feature_name)
上,查询应该非常快