如何在下表中找到对

时间:2019-05-29 05:14:49

标签: sql-server database

查找在同一文档ID中出现的所有常见单词对,并报告该对文档中出现的文档数。以对频率降低的顺序进行报告。

  • 请注意,不应有任何重复的条目,例如 o(卡车,船)(卡车,船)
  • 请注意,同一对不应以相反的顺序出现两次。只有一个 应该发生以下情况: o(卡车,船)(船,卡车)
+-------+-----+-----+---------+
|vocabId|docId|count|     word|
+-------+-----+-----+---------+
|      1|    1| 1000|    plane|
|      1|    3|  100|    plane|
|      3|    1| 1200|motorbike|
|      3|    2|  702|motorbike|
|      3|    3|  600|motorbike|
|      5|    3| 2000|     boat|
|      5|    2|  200|     boat|
+-------+-----+-----+---------+

我使用了这个查询,但是它给了我错误的结果

select r1.word,r2.word, count(*) 
from result_T r1 
JOIN result_T r2 ON r1.docId = r2.docId 
and r1.word = r2.word group by r1.word, r2.word

预期的输出量:

boat, motorbike, 2
motorbike, plane, 2
boat, plane, 1

2 个答案:

答案 0 :(得分:1)

使用自我联接在正确的轨道上,但是联接逻辑需要稍作更改。连接条件应该是第一个单词比第二个单词在词典上小于。这样可以确保对不会重复计算。另外,文档ID必须匹配(您已经在检查此内容)。

SELECT
    r1.word,
    r2.word,
    COUNT(*) AS cnt
FROM result_T r1
INNER JOIN result_T r2
    ON r1.word < r2.word AND
       r1.docId = r2.docId
GROUP BY
    r1.word,
    r2.word
ORDER BY
    COUNT(*) DESC;

enter image description here

Demo

答案 1 :(得分:0)

请尝试以下查询:

declare @tbl table (docId int, word varchar(20));
insert into @tbl values 
( 1,'plane'),
( 3,'plane'),
( 1,'motorbike'),
( 2,'motorbike'),
( 3,'motorbike'),
( 3,'boat'),
( 2,'boat');

select words, count(*) from (
    select distinct t1.docId,
           case when t1.word < t2.word then t1.word else t2.word end + ',' +
           case when t1.word >= t2.word then t1.word else t2.word end words
    from @tbl t1
    join @tbl t2 on t1.docId = t2.docId and t1.word <> t2.word
) a group by words
order by count(*) desc