如何计算从一个表到另一个表中的注释的单词的出现次数

时间:2017-10-22 06:33:52

标签: sql google-bigquery standard-sql

我正在尝试在Google的BigQuery中完成一项任务,这可能需要逻辑我不确定SQL可以本地处理。

我有两张桌子:

  1. 第一个表有一个列,每行是一个小写字
  2. 第二个表是一个评论数据库(包括评论,评论本身,时间戳等数据)
  3. 我想根据第一个表中单词的出现次数对第二个表中的注释进行排序。

    以下是我想要做的基本示例,使用python,使用字母而不是单词......但是你明白了这一点:

    words = ['a','b','c','d','e']
    
    comments = ['this is the first sentence', 'this is another comment', 'look another sentence, which is also a comment', 'nope', 'no', 'run']
    
    wordcount = {}
    
    for comment in comments:
        for word in words:
            if word in comment:
                if comment in wordcount:
                    wordcount[comment] += 1
                else:
                    wordcount[comment] = 1
    
    print(sorted(wordcount.items(), key = lambda k: k[1], reverse=True))
    

    输出:

    [('look another sentence, which is also a comment', 3), ('this is another comment', 3), ('this is the first sentence', 2), ('nope', 1)]
    

    到目前为止,我已经看到生成SQL查询的最好的事情是执行以下操作:

    SELECT
        COUNT(*)
    FROM
        table
    WHERE
        comment_col like '%word1%'
        OR comment_col like '%word2%'
        OR ...
    

    但是有超过2000个单词......它感觉不对。有什么提示吗?

2 个答案:

答案 0 :(得分:2)

以下是BigQuery Standard SQL

  
#standardSQL
SELECT comment, COUNT(word) AS cnt
FROM comments
JOIN words
ON STRPOS(comment, word) > 0 
GROUP BY comment
-- ORDER BY cnt DESC  

如果您愿意,可以使用正则表达式:

#standardSQL
SELECT comment, COUNT(word) AS cnt
FROM comments
JOIN words
ON REGEXP_CONTAINS(comment, word)
GROUP BY comment
-- ORDER BY cnt DESC  

您可以使用问题

中的虚拟示例来测试/播放上面的内容
#standardSQL
WITH words AS (
  SELECT word
  FROM UNNEST(['a','b','c','d','e']) word
),
comments AS (
  SELECT comment 
  FROM UNNEST(['this is the first sentence', 'this is another comment', 'look another sentence, which is also a comment', 'nope', 'no', 'run']) comment
)
SELECT comment, COUNT(word) AS cnt
FROM comments
JOIN words
ON STRPOS(comment, word) > 0 
GROUP BY comment
ORDER BY cnt DESC 

更新:

  

有任何快速建议只能进行完整的字符串匹配吗?

#standardSQL
WITH words AS (
  SELECT word
  FROM UNNEST(['a','no','is','d','e']) word
),
comments AS (
  SELECT comment 
  FROM UNNEST(['this is the first sentence', 'this is another comment', 'look another sentence, which is also a comment', 'nope', 'no', 'run']) comment
)
SELECT comment, COUNT(word) AS cnt
FROM comments
JOIN words
ON REGEXP_CONTAINS(comment, CONCAT(r'\b', word, r'\b')) 
GROUP BY comment
ORDER BY cnt DESC

答案 1 :(得分:1)

如果我理解得很好,我认为您需要这样的查询:

select comment, count(*) cnt
from comments
join words
  on comment like '% ' + word + ' %'   --this checks for `... word ..`; a word between spaces
  or comment like word + ' %'          --this checks for `word ..`; a word at the start of comment
  or comment like '% ' + word          --this checks for `.. word`; a word at the end of comment
  or comment = word                    --this checks for `word`; whole comment is the word
group by comment
order by count(*) desc

SQL Server Fiddle Demo as a sample