我有一个包含架构的文件表:
CREATE TABLE Frequency (
docid VARCHAR(255),
term VARCHAR(255),
count int,
PRIMARY KEY(docid, term));
要查找我将使用的所有文档的相似性原始分数:
SELECT a.term, b.term, sum(a.count * b.count)
FROM Frequency a, Frequency b
Where a.term = b.term
我不确定为什么会这样,但它确实在测试数据上做D * DT,其中DT是D的转置。
我现在需要计算查询/文本字符串相似性,例如“国会枪法”
我认为这涉及工会和分组,但我的所有查询尝试都失败,例如:
SELECT *
FROM Frequency a, Frequency b, Frequency c
Where a.term = b.term
UNION
SELECT a.docid, 'congress' as term, 1 as count
UNION
SELECT b.docid , 'gun' as term, 1 as count
UNION
SELECT c.docid , 'laws' as term, 1 as count
Group by docid;
我是这种SQL的新手,并且在我试图理解时会欣赏一个叙述 我也在做什么。
请解释为什么第一个查询有效,以及我如何接近第二个查询。
答案 0 :(得分:2)
简单地说,我们真正想要做的是将新元组添加到表中,然后使用上面提到的矩阵转置操作将这个新表与旧表进行比较。您需要的是“标记”这些新关键字,以便您可以将它们用于查询中的条件。所以这个
SELECT b.docid, b.term, SUM(a.count * b.count)
FROM (SELECT * FROM Frequency
UNION
SELECT 'q' as docid, 'congress' as term, 1 as count
UNION
SELECT 'q' as docid, 'gun' as term, 1 as count
UNION
SELECT 'q' as docid, 'laws' as term, 1 as count
) a, Frequency b
WHERE a.term = b.term
AND a.docid = 'q'
GROUP BY b.docid, b.term
ORDER BY SUM(a.count * b.count);
会为您提供带有该术语及其相应相似度分数的docids列表。
答案 1 :(得分:0)
你的问题和评论是不可理解的。
但是以下查询显示了包含所有三个术语的所有文档的三个术语的出现次数:
SELECT a.docid,
a.count,
b.count,
c.count
FROM Frequency AS a
JOIN Frequency AS b ON a.docid = b.docid
JOIN Frequency AS c ON b.docid = c.docid
WHERE a.term = 'congress'
AND b.term = 'gun'
AND c.term = 'laws'