Question

我有一份来自几家报纸的新闻列表（从 RSS 提要中获取）。假设每份报纸都返回带有标签的新闻列表。例如：

报纸 1：

title1、tag1、tag2、tag3
title2、tag1、tag7、tag5、tag8

newspaper2：

title3、tag3、tag4、tag5
title4、tag1、tag5、tag7、tag9、tag10

所以，我正在考虑将所有新闻存储在一个表中（newspaper_id、news_id、title），然后在另一个表中存储每个标签一行（news_id、tag_name）。

现在，我需要查询表格并将第一家报纸的每条新闻与其余报纸新闻进行比较，并返回相似的新闻。在我的示例数据中，title1 与来自另一份报纸的 title3 和 4 共享一个标签。而title2，与title4 共享3 个标签，与title3 仅共享一个标签。我需要这个，报纸上的每条新闻与其他新闻共享多少标签

我一直在同一张桌子上尝试使用 GROUP BY 或 INNER JOIN，但没有成功。有什么想法吗？

建表和插入数据语句：

CREATE TABLE news (
newspaper_id INT(6),
news_id INT(6) PRIMARY KEY,
title VARCHAR(250) NOT NULL 
); 

CREATE TABLE tags ( 
news_id INT(6) NOT NULL,
name VARCHAR(30) NOT NULL
); 


INSERT INTO `news`VALUES (1, 1, 'USA elections'), (1, 2, 'Coronavirus crisis'), (2, 3, 'Another thing about USA elections'), (2, 4, 'Who will win elections?'), (3, 5, 'Coronavirus affetcs elections');
INSERT INTO `tags`VALUES (1, 'elections'), (1, 'biden'), (1, 'trump'), (2, 'coronavirus'), (3, 'biden'), (3, 'trump'), (3, 'elections'), 
(4, 'elections'), (5, 'coronavirus'), (5, 'elections');

预期结果：

| Title                  | news_id || compared_news_id || Tags in common |
| ---------------------- | ------- || ---------------- || -------------- |
| 'USA elections'        | 1       || 3                || 3              |
| 'USA elections'        | 1       || 4                || 1              |
| 'USA elections'        | 1       || 5                || 1              |
| 'Coronavirus crisis'   | 2       || 5                || 1

Answer 1

如果您根本不关心不匹配的新闻，那么只需注意匹配的标签即可。

select
  n1.news_id, n1.title,
  n2.news_id as compared_news_id, n2.title as compared_news_title,
  count(*) as tags_in_common
from news n1
join news n2 on n2.news_id <> n1.news_id
join tags t1 on t1.news_id = n1.news_id
join tags t2 on t2.news_id = n2.news_id and t2.name = t1.name
where n1.newspaper_id = 1
group by n1.news_id, n2.news_id
order by n1.news_id, n2.news_id;

演示：https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=6ff1db3be344c40b82f892654ca08e3a

如果您不想将其限制为一份报纸，则删除 where n1.newspaper_id = 1。在这种情况下，如果您想避免在结果中同时包含 news1/news5 和 news5/news1，请将 n2.news_id <> n1.news_id 更改为 n2.news_id > n1.news_id。

当然，如果您不想比较同一报纸的新闻，您也可以将 on n2.news_id <> n1.news_id 更改为 on n2.news_id <> n1.news_id and n2.newspaper_id <> n1.newspaper_id。

Answer 2

第 1 步：在键上加入 News 和 Tag 表。

第 2 步：制作 2 个此类连接的实例。

第 3 步：在标签上加入这两个实例。

第 4 步：过滤掉记录，使相同的新闻 ID 在最终结果中不相互匹配。

select
  n1.title,
  n2.title as compared_news_title,
  n1.news_id, 
  n2.news_id as compared_news_id, 
  count(*) as tags_in_common
from 
  news n1,
  tags t1, 
  news n2, 
  tags t2 
where 
t1.news_id = n1.news_id
and t2.news_id = n2.news_id
and t2.name = t1.name
and n2.news_id <> n1.news_id
group by n1.news_id, n2.news_id
order by n1.news_id, n2.news_id;

Link to Fiddle

根据相似的标签查找关于 SQL 的相似新闻

2 个答案: