查找在同一文档ID中出现的所有常见单词对,并报告该对文档中出现的文档数。以对频率降低的顺序进行报告。
+-------+-----+-----+---------+
|vocabId|docId|count| word|
+-------+-----+-----+---------+
| 1| 1| 1000| plane|
| 1| 3| 100| plane|
| 3| 1| 1200|motorbike|
| 3| 2| 702|motorbike|
| 3| 3| 600|motorbike|
| 5| 3| 2000| boat|
| 5| 2| 200| boat|
+-------+-----+-----+---------+
我使用了这个查询,但是它给了我错误的结果
select r1.word,r2.word, count(*)
from result_T r1
JOIN result_T r2 ON r1.docId = r2.docId
and r1.word = r2.word group by r1.word, r2.word
预期的输出量:
boat, motorbike, 2
motorbike, plane, 2
boat, plane, 1
答案 0 :(得分:1)
使用自我联接在正确的轨道上,但是联接逻辑需要稍作更改。连接条件应该是第一个单词比第二个单词在词典上小于。这样可以确保对不会重复计算。另外,文档ID必须匹配(您已经在检查此内容)。
SELECT
r1.word,
r2.word,
COUNT(*) AS cnt
FROM result_T r1
INNER JOIN result_T r2
ON r1.word < r2.word AND
r1.docId = r2.docId
GROUP BY
r1.word,
r2.word
ORDER BY
COUNT(*) DESC;
答案 1 :(得分:0)
请尝试以下查询:
declare @tbl table (docId int, word varchar(20));
insert into @tbl values
( 1,'plane'),
( 3,'plane'),
( 1,'motorbike'),
( 2,'motorbike'),
( 3,'motorbike'),
( 3,'boat'),
( 2,'boat');
select words, count(*) from (
select distinct t1.docId,
case when t1.word < t2.word then t1.word else t2.word end + ',' +
case when t1.word >= t2.word then t1.word else t2.word end words
from @tbl t1
join @tbl t2 on t1.docId = t2.docId and t1.word <> t2.word
) a group by words
order by count(*) desc