如何搜索包含特定单词的行,然后返回每个单词的计数?

时间:2016-10-07 09:18:54

标签: sql google-bigquery

我有150,000行数据,我试图在Google BigQuery中查询。

Text包含不同长度的文本,我想从中查询特定关键字。

我已经知道下面的查询返回包含特定关键字的所有行(例如facebook):

SELECT Text From Data.Set_1 
WHERE Text CONTAINS 'facebook'

问题:

1)如何改进查询,以便返回关键字' facebook'的所有匹配项的总计数。跨越' Text'在一个新专栏?

2)如何将其升级为多个关键词(facebook,cnn,bbc,twitter)并返回数据中存在的每个关键词的总数(例如facebook 42,cnn 54,bbc 88,twitter 49)?< / p>

2 个答案:

答案 0 :(得分:0)

您可以使用派生表来包含您要查找的所有单词,然后使用聚合来计算匹配项:

SELECT w.keyword, COUNT(s.Text)
From (SELECT 'facebook' as keyword UNION ALL
      SELECT 'cnn'
     ) w LEFT JOIN
     Data.Set_1 s
     ON s.Text CONTAINS w.keyword
GROUP BY w.keyword;

请注意:这不是特别有效。性能应该与关键字数量大致呈线性关系。

答案 1 :(得分:0)

for BigQuery Legacy SQL

SELECT 
  keyword, 
  COUNT(1) AS rows, 
  SUM(INTEGER((LENGTH(Text) - LENGTH(REPLACE(Text, keyword, ''))) / LENGTH(keyword))) AS occurences 
FROM YourTable 
CROSS JOIN keywords
WHERE Text CONTAINS keyword
GROUP BY keyword

使用

的示例
SELECT 
  keyword, 
  COUNT(1) AS rows, 
  SUM(INTEGER((LENGTH(Text) - LENGTH(REPLACE(Text, keyword, ''))) / LENGTH(keyword))) AS occurences 
FROM (
  SELECT Text FROM
    (SELECT 'facebookfacebookcnnbbccnn' AS Text),
    (SELECT 'facebook' AS Text), 
    (SELECT 'cnn' AS Text)
) AS words 
CROSS JOIN (
  SELECT keyword FROM 
    (SELECT 'facebook' AS keyword),
    (SELECT 'cnn' AS keyword), 
    (SELECT 'bbc' AS keyword)
) AS keywords
WHERE Text CONTAINS keyword
GROUP BY keyword

对于BigQuery Standard SQL(请参阅Enabling Standard SQL

SELECT 
  keyword, 
  COUNT(1) AS `rows`, 
  SUM((LENGTH(Text) - LENGTH(REPLACE(Text, keyword, ''))) / LENGTH(keyword)) AS occurences  
FROM YourTable 
JOIN keywords
ON STRPOS(Text, keyword) > 0
GROUP BY keyword

使用

的示例
WITH keywords AS (
  SELECT 'facebook' AS keyword UNION ALL
  SELECT 'cnn' AS keyword UNION ALL
  SELECT 'bbc' AS keyword 
),
words AS (
  SELECT 'facebookfacebookcnnbbccnn' AS Text UNION ALL
  SELECT 'facebook' AS Text UNION ALL
  SELECT 'cnn' AS Text 
)
SELECT 
  keyword, 
  COUNT(1) AS `rows`, 
  SUM((LENGTH(Text) - LENGTH(REPLACE(Text, keyword, ''))) / LENGTH(keyword)) AS occurences  
FROM words 
JOIN keywords
ON STRPOS(Text, keyword) > 0
GROUP BY keyword