我有一个像这样的城市表。
|id| Name |
|1 | Paris |
|2 | London |
|3 | New York|
我有一个看起来像这样的标签表。
|id| tag |
|1 | Europe |
|2 | North America |
|3 | River |
和cities_tags表:
|id| city_id | tag_id |
|1 | 1 | 1 |
|2 | 1 | 3 |
|3 | 2 | 1 |
|4 | 2 | 3 |
|5 | 3 | 2 |
|6 | 3 | 3 |
如何计算哪个是最密切相关的城市?例如。如果我在城市1(巴黎)看,结果应该是:伦敦(2),纽约(3)
我找到了Jaccard index,但我不确定如何最好地实现这一点。
答案 0 :(得分:15)
您对 的疑问如何计算哪个是最密切相关的城市?例如。如果我正在查看1号城市(巴黎),结果应该是:伦敦(2),纽约(3) ,根据您提供的数据集,只有一件事可以联系到城市之间的共同标签,因此共享公共标签的城市将是下面最接近的城市是子查询,它查找共享公共标签的城市(提供其他城市以找到最近的城市)
SELECT * FROM `cities` WHERE id IN (
SELECT city_id FROM `cities_tags` WHERE tag_id IN (
SELECT tag_id FROM `cities_tags` WHERE city_id=1) AND city_id !=1 )
我假设您将输入一个城市ID或名称,以便在我的案例中找到他们最接近的一个“巴黎”有一个id
SELECT tag_id FROM `cities_tags` WHERE city_id=1
它会找到paris当时的所有标签ID
SELECT city_id FROM `cities_tags` WHERE tag_id IN (
SELECT tag_id FROM `cities_tags` WHERE city_id=1) AND city_id !=1 )
它将获取除巴黎之外的所有城市,这些城市具有与巴黎相同的标签
这是您的Fiddle
虽然阅读 Jaccard相似度/指数,但发现了一些要了解实际条款的内容,我们有两套A&乙
设置A = {A,B,C,D,E}
设置B = {I,H,G,F,E,D}
计算jaccard相似度的公式是JS =(A交叉B)/(A 联盟B)
交叉点B = {D,E} = 2
联盟B = {A,B,C,D,E,I,H,G,F} = 9
JS = 2/9 = 0.2222222222222222
现在转向你的场景
巴黎有tag_ids 1,3所以我们制作一套这个并调用我们的Set P = {欧洲,河流}
伦敦有tag_ids 1,3所以我们制作一套这个并打电话给我们 设L = {欧洲,河}
纽约有tag_ids 2,3所以我们制作了这个,并打电话给我们 设置NW = {北美,河}
使用伦敦JSPL = P与L / P联盟L交叉推算JS Paris JSPL = 2/2 = 1
使用纽约JSPNW = P与NW / P相交来判断JS Paris union NW,JSPNW = 1/3 = 0.3333333333
到目前为止,这是查询完美的jaccard索引,您可以看到下面的小提琴示例
SELECT a.*,
( (CASE WHEN a.`intersect` =0 THEN a.`union` ELSE a.`intersect` END ) /a.`union`) AS jaccard_index
FROM (
SELECT q.* ,(q.sets + q.parisset) AS `union` ,
(q.sets - q.parisset) AS `intersect`
FROM (
SELECT cities.`id`, cities.`name` , GROUP_CONCAT(tag_id SEPARATOR ',') sets ,
(SELECT GROUP_CONCAT(tag_id SEPARATOR ',') FROM `cities_tags` WHERE city_id= 1)AS parisset
FROM `cities_tags`
LEFT JOIN `cities` ON (cities_tags.`city_id` = cities.`id`)
GROUP BY city_id ) q
) a ORDER BY jaccard_index DESC
在上面的查询中,我已经将结果集派生为两个子选择,以便获取我的自定义计算别名
您可以在上面的查询中添加过滤器,以便不计算与自身的相似性
SELECT a.*,
( (CASE WHEN a.`intersect` =0 THEN a.`union` ELSE a.`intersect` END ) /a.`union`) AS jaccard_index
FROM (
SELECT q.* ,(q.sets + q.parisset) AS `union` ,
(q.sets - q.parisset) AS `intersect`
FROM (
SELECT cities.`id`, cities.`name` , GROUP_CONCAT(tag_id SEPARATOR ',') sets ,
(SELECT GROUP_CONCAT(tag_id SEPARATOR ',') FROM `cities_tags` WHERE city_id= 1)AS parisset
FROM `cities_tags`
LEFT JOIN `cities` ON (cities_tags.`city_id` = cities.`id`) WHERE cities.`id` !=1
GROUP BY city_id ) q
) a ORDER BY jaccard_index DESC
结果显示巴黎与伦敦密切相关,然后与纽约有关
答案 1 :(得分:7)
select c.name, cnt.val/(select count(*) from cities) as jaccard_index
from cities c
inner join
(
select city_id, count(*) as val
from cities_tags
where tag_id in (select tag_id from cities_tags where city_id=1)
and not city_id in (1)
group by city_id
) as cnt
on c.id=cnt.city_id
order by jaccard_index desc
此查询静态引用city_id=1
,因此您必须在where tag_id in
子句和not city_id in
子句中创建该变量。
如果我正确理解了Jaccard索引,那么它也会返回由“最密切相关”排序的值。我们的示例中的结果如下所示:
|name |jaccard_index |
|London |0.6667 |
|New York |0.3333 |
更好地了解如何实施Jaccard Index:
在维基百科上阅读了关于Jaccard Index的更多信息之后,我想出了一个更好的方法来实现我们的示例数据集的查询。基本上,我们将独立地将我们选择的城市与列表中的每个城市进行比较,并使用共同标签的数量除以两个城市之间选择的不同总标签的数量。
select c.name,
case -- when this city's tags are a subset of the chosen city's tags
when not_in.cnt is null
then -- then the union count is the chosen city's tag count
intersection.cnt/(select count(tag_id) from cities_tags where city_id=1)
else -- otherwise the union count is the chosen city's tag count plus everything not in the chosen city's tag list
intersection.cnt/(not_in.cnt+(select count(tag_id) from cities_tags where city_id=1))
end as jaccard_index
-- Jaccard index is defined as the size of the intersection of a dataset, divided by the size of the union of a dataset
from cities c
inner join
(
-- select the count of tags for each city that match our chosen city
select city_id, count(*) as cnt
from cities_tags
where tag_id in (select tag_id from cities_tags where city_id=1)
and city_id!=1
group by city_id
) as intersection
on c.id=intersection.city_id
left join
(
-- select the count of tags for each city that are not in our chosen city's tag list
select city_id, count(tag_id) as cnt
from cities_tags
where city_id!=1
and not tag_id in (select tag_id from cities_tags where city_id=1)
group by city_id
) as not_in
on c.id=not_in.city_id
order by jaccard_index desc
查询有点冗长,我不知道它的扩展程度如何,但它确实实现了一个真正的Jaccard索引,正如问题所要求的那样。以下是新查询的结果:
+----------+---------------+
| name | jaccard_index |
+----------+---------------+
| London | 1.0000 |
| New York | 0.3333 |
+----------+---------------+
再次编辑以向查询添加评论,并考虑当前城市的标签是所选城市标签的子集时
答案 2 :(得分:2)
此查询没有任何奇特的功能甚至是子查询。它很快。只需确保cities.id,cities_tags.id,cities_tags.city_id和cities_tags.tag_id都有索引。
查询返回的结果包含: city1 , city2 以及计数 city1和city2共有多少个标签。
select
c1.name as city1
,c2.name as city2
,count(ct2.tag_id) as match_count
from
cities as c1
inner join cities as c2 on
c1.id != c2.id -- change != into > if you dont want duplicates
left join cities_tags as ct1 on -- use inner join to filter cities with no match
ct1.city_id = c1.id
left join cities_tags as ct2 on -- use inner join to filter cities with no match
ct2.city_id = c2.id
and ct1.tag_id = ct2.tag_id
group by
c1.id
,c2.id
order by
c1.id
,match_count desc
,c2.id
将!=
更改为>
,以避免每个城市被退回两次。这意味着一个城市将不再出现在第一列中,也不再出现在第二列中。
如果您不希望看到没有标记匹配的城市组合,请将两个left join
更改为inner join
。
答案 3 :(得分:2)
太迟了,但我认为没有一个答案是完全正确的。我得到了每个人最好的部分,并将所有人放在一起做出我自己的答案:
(q.sets + q.parisset) AS union
和union
的实施非常强烈错即可。 (q.sets - q.parisset) AS
表。intersect
intersect
像这样的 cities
表。
| id | Name |
| 1 | Paris |
| 2 | Florence |
| 3 | New York |
| 4 | São Paulo |
| 5 | London |
根据此示例数据,佛罗伦萨与巴黎完整匹配,纽约匹配一个标记, 圣保罗有无标记匹配,伦敦匹配两个标记,还有另一个。我认为这个样本的Jaccard指数是:
佛罗伦萨: 1.000(2/2)
伦敦: 0.666(2/3)
纽约: 0.333(1/3)
圣保罗: 0.000(0/3)
我的查询是这样的:
cities_tag
答案 4 :(得分:1)
这可能是推动正确的方向吗?
SELECT cities.name, (
SELECT cities.id FROM cities
JOIN cities_tags ON cities.id=cities_tags.city_id
WHERE tags.id IN(
SELECT cities_tags.tag_id
FROM cites_tags
WHERE cities_tags.city_id=cites.id
)
GROUP BY cities.id
HAVING count(*) > 0
) as matchCount
FROM cities
HAVING matchCount >0
我试过的是:
//找到城市名:
获取city.names(SUBQUERY)作为matchCount FROM cities WHERE matchCount> 0
//子查询:
选择城市拥有的标签数量(SUBSUBQUERY)还有
//子查询
选择原始名称的标签ID