我有一大堆来自reddit的评论。将字符串拆分为单词,删除标点符号,并量化以显示特定子编辑中最常用的单词:
SELECT word, COUNT(*) as num_words
FROM(FLATTEN((
SELECT SPLIT(LOWER(REGEXP_REPLACE(body, r'[\.\",*:()\[\]/|\n]', ' ')), ' ') word
FROM [fh-bigquery:reddit_comments.2017_08]
WHERE subreddit="The_Donald"
), word))
GROUP EACH BY word
HAVING num_words >= 1000
ORDER BY num_words DESC
我有一个要删除的停用词列表,我该如何将其添加到代码中?谢谢:))
答案 0 :(得分:4)
以下示例适用于BigQuery Legacy SQL(正如您的问题所示)
#legacydSQL
SELECT word, COUNT(*) AS num_words
FROM(FLATTEN((
SELECT SPLIT(LOWER(REGEXP_REPLACE(body, r'[\.\",*:()\[\]/|\n]', ' ')), ' ') word
FROM [fh-bigquery:reddit_comments.2017_08]
WHERE subreddit="The_Donald"
), word))
WHERE NOT word IN (
'the','to','a','and'
)
GROUP EACH BY word
HAVING num_words >= 1000
ORDER BY num_words DESC
BigQuery Team强烈建议使用标准SQL
因此,如果您决定migrate
- 以下是标准SQL中的示例
它假设您在your_project.your_dataset.stop_words
表
#standardSQL
SELECT word, COUNT(*) AS num_words
FROM `fh-bigquery.reddit_comments.2017_08`,
UNNEST(SPLIT(LOWER(REGEXP_REPLACE(body, r'[\.\",*:()\[\]/|\n]', ' ')), ' ')) word
WHERE subreddit="The_Donald"
AND word NOT IN (SELECT stop_word FROM `your_project.your_dataset.stop_words`)
GROUP BY word
HAVING num_words >= 1000
AND word != ''
ORDER BY num_words DESC
您可以在此处测试/播放以下虚拟数据
#standardSQL
WITH `your_project.your_dataset.stop_words` AS (
SELECT stop_word
FROM UNNEST(['the','to','a','and']) stop_word
)
SELECT word, COUNT(*) AS num_words
FROM `fh-bigquery.reddit_comments.2017_08`,
UNNEST(SPLIT(LOWER(REGEXP_REPLACE(body, r'[\.\",*:()\[\]/|\n]', ' ')), ' ')) word
WHERE subreddit="The_Donald"
AND word NOT IN (SELECT stop_word FROM `your_project.your_dataset.stop_words`)
GROUP BY word
HAVING num_words >= 1000
AND word != ''
ORDER BY num_words DESC