正如标题所示,我在实现相关文章算法时遇到了问题。让我首先列出数据库中的表:
[articles]
id_article
id_category
name
content
publish_date
is_deleted
[categories]
id_category
id_parent
name
[tags_to_articles]
id_tag
id_article
[tags]
id_tag
name
[articles_to_authors]
id_article
id_author
[authors]
id_author
name
is_deleted
[related_articles]
id_article_left
id_article_right
related_score
除related_articles之外的其他每个表都包含数据。现在我想填写相关文章与文章之间的分数(非常重要:表格将作为一个定向图表,文章A与文章B的分数可能不同于B和A之间的分数,请参阅列表)。分数计算如下:
我试图像这样进行查询:
SELECT a.id, b.id, a.id_category, a.publish_date,
b.id_category, b.publish_date,
c.id_tag,
e.id_author
FROM `articles` a, articles b,
tags_to_articles c, tags_to_articles d,
articles_to_authors e, articles_to_authors f
WHERE a.id_article <> b.id_article AND
(
(a.id_article=c.id_article and c.id_tag=d.id_tag and d.id_article=b.id_article)
OR
(a.id=e.id_article and e.id_author=f.id_author and f.id_article=b.id_article)
OR
(a.id_category=b.id_category)
)
理论上,这将列出每个值得计算得分的元素。但是,这需要花费太多时间和资源。
还有其他方法吗?如果它得到一个可行的解决方案,我也愿意调整算法或表格。另外值得注意的是,分数计算是在cron中完成的,当然我不希望在每个页面请求上运行。
答案 0 :(得分:4)
我严重怀疑你能用一个声明做这样的事情并获得任何表现。把它分解成碎片。使用临时表。使用set operations。
-- First, let's list all tables that share a category.
SELECT a1.id_article as 'left_article',
a2.id_article as 'right_article',
1 as 'score'
INTO #tempscore
FROM #articles a1
INNER JOIN #articles a2 ON
a1.id_category = a2.id_category
AND a1.id_article <> a2.id_article
-- Now, let's add up everything that shares an author
INSERT INTO #tempscore (left_article, right_article, score)
SELECT ata1.id_article,
ata2.id_article,
2
FROM #articles_to_authors ata1
INNER JOIN #articles_to_authors ata2 ON
ata1.id_author = ata2.id_author
-- Now, let's add up everything that shares a a tag
INSERT INTO #tempscore (left_article, right_article, score)
SELECT ata1.id_article,
ata2.id_article,
4
FROM #tags_to_articles ata1
INNER JOIN #tags_to_articles ata2 ON
ata1.id_tag = ata2.id_tag
-- We haven't looked at dates, yet, but let's go ahead and consolidate what we know.
SELECT left_article as 'left_article',
right_article as 'right_article',
SUM (score) as 'total_score'
INTO #cscore
FROM #tempscore
GROUP BY left_article,
right_article
-- Clean up some extranneous stuff
DELETE FROM #cscore WHERE left_article = right_article
-- Now we need to deal with dates
SELECT DateDiff (Day, art1.publish_date, art2.publish_date) as 'datescore',
art1.id_article as 'left_article',
art2.publish_date as 'right_article'
INTO #datescore
FROM #cscore
INNER JOIN #articles art1 ON
#cscore.left_article = art1.id_article
INNER JOIN #articles art2 ON
#cscore.right_article = art2.id_article
WHERE art1.publish_date > art2.publish_date
-- And finally, put it all together
INSERT INTO #related_articles (id_article_left, id_article_right, related_score)
SELECT s1.left_article,
s1.right_article,
s1.total_score + IsNull (s2.datescore, 0)
FROM #cscore s1
LEFT JOIN #datescore s2 ON
s1.left_article = s2.left_article
AND s1.right_article = s2.right_article
在我的测试中,分数似乎正确,但我没有任何真实的样本数据,所以我不能确定。如果不出意外,这应该为您提供一个基础。
答案 1 :(得分:2)
你的方法有一个正确的概念,你需要一个自己的文章表的笛卡尔积。这是我能提出的最佳解决方案,但需要进行一些测试:
INSERT INTO related_articles
SELECT a_left.id_article,a_right.id_article,
IF(a_left.id_category = a_right.id_category,x,0) +
IF( IFNULL(atu1.id_author,0) AND IFNULL(atu2.id_author,0),
IF(atu1.id_author = atu2.id_author,y,0), 0
) +
IF( IFNULL(tta1.id_tag,0) AND IFNULL(tta2.id_tag,0),
IF(tta1.id_tag = tta2.id_tag,z,0), 0
)
-(CURRENT_TIMESTAMP - UNIX_TIMESTAMP(a_right.publish_date)) AS score
FROM
articles a_left join articles a_right ON a_left.id_article<>a_right.id_article
AND aleft.id_article > CHECKPOINT_ID
LEFT OUTER JOIN articles_to_authors atu1 ON atu1.id_article = a_left.id_article
LEFT OUTER JOIN articles_to_authors atu2 ON atu2.id_article = a_right.id_article
LEFT OUTER JOIN tags_to_articles tta1 ON tta1.id_article = a_left.id_article
LEFT OUTER JOIN tags_to_articles tta2 ON tta2.id_article = a_right.id_article
也许你需要2个额外的LEFT JOIN来处理被删除的作者。这里的关键是可以使用的 CHECKPOINT_ID 参数,以便您可以逐步执行此过程。这将使您能够处理新文章。替代方案(虽然我看不出原因)将添加一个条件,如
... ON a_left.id_article<>a_right.id_article AND
NOT EXISTS(SELECT id_article_left FROM
related_articles WHERE id_article_left = a_left.id_article AND
id_article_right = a_right.id_article) ...
答案 2 :(得分:0)
我在Sql Server中使用了一个方法
我为每篇文章提供了相关标签
然后我通过匹配标签获得相关文章,更多相同标签意味着更多相关
ALTER PROCEDURE [dbo].[GetRelatedArticles]
@ArticleLang int,
@ArticleURI varchar(100),
@Count int = 10
AS
SET NOCOUNT ON
DECLARE @URI dbo.URICountType;
INSERT INTO @URI([URI], [Count])
SELECT TOP (@Count) ArticleTag.ArticleURI, COUNT(ArticleTag.ArticleURI) AS ArticleCount
FROM ArticleTag WITH (NOLOCK)
INNER JOIN ArticleTag AS ArticleTags WITH (NOLOCK)
ON ArticleTags.ArticleURI = @ArticleURI
AND ArticleTag.ArticleURI <> @ArticleURI
AND ArticleTag.ArticleTag = ArticleTags.ArticleTag
GROUP BY ArticleTag.ArticleURI
SELECT Article.ArticleURI, Article.ArticleLang
FROM Article WITH (NOLOCK)
INNER JOIN (
SELECT MIN(ABS(ArticleLang-@ArticleLang)) AS ArticleLangDifference, ArticleURI
FROM Article WITH (NOLOCK)
WHERE ArticleURI IN (SELECT URI FROM @URI)
GROUP BY ArticleURI
) AS ArticleGrounp
ON Article.ArticleURI = ArticleGrounp.ArticleURI
AND ABS(Article.ArticleLang-@ArticleLang) = ArticleGrounp.ArticleLangDifference
INNER JOIN @URI AS URI
ON Article.ArticleURI = URI.URI
ORDER BY URI.Count DESC, Article.ArticleLastUpdate DESC