我正试图在有关Asperger的AskDocs subreddit中获取帖子及其评论,此sql可以很好地获取帖子
#standardSQL
SELECT
TIMESTAMP_TRUNC(TIMESTAMP_SECONDS(created_utc), MONTH, 'America/New_York') AS date_agg, title,selftext
FROM
`fh-bigquery.reddit_posts.*`
WHERE
(_TABLE_SUFFIX BETWEEN "2016_01" AND "2019_03" OR _TABLE_SUFFIX = 'full_corpus_201512')
AND subreddit = 'AskDocs'
AND REGEXP_CONTAINS(selftext, r'Asperger')
ORDER BY
date_agg
但是我不确定是否可以得到所有可用的帖子,我有169行,但是我试图在AskDocs中获得尽可能多的关于这一主题的信息。
第二个问题是因为我试图将每个帖子及其评论链接在一起,所以我在这里找到了
#standardSQL
SELECT posts.title, comments.body
FROM `fh-bigquery.reddit_comments.2016_01` AS comments
JOIN `fh-bigquery.reddit_posts.2016_01` AS posts
ON posts.id = SUBSTR(comments.link_id, 4)
WHERE posts.id = '43go1r'
但是当我尝试在此处合并代码时,我确实感到一团糟
答案 0 :(得分:0)
对于第一个查询,由于在正则表达式中使用了大写字母A,因此将获得169行,并且仅列出包含单词Asperger的自文本,例如: Asperger , Asperger's , Aspergers 等。标题中包含 asperger , asperger's , aspergers ,将不会列出,因为您仅在正则表达式中使用大写字母A。要列出包含小写字母的单词,您需要在正则表达式中使用 [aA] ,它将显示241行:
AND REGEXP_CONTAINS(posts.selftext, r'[aA]sperger')
要连接表,可以使用以下查询:
WITH
comments AS (
SELECT
link_id,
body
FROM
`fh-bigquery.reddit_comments.201*`
WHERE
_TABLE_SUFFIX BETWEEN "6_01"
AND "9_03"
AND subreddit = 'AskDocs' ),
posts AS (
SELECT
TIMESTAMP_TRUNC(TIMESTAMP_SECONDS(created_utc), MONTH, 'America/New_York') AS date_agg,
id,
selftext,
title
FROM
`fh-bigquery.reddit_posts.*`
WHERE
(_TABLE_SUFFIX BETWEEN "2016_01"
AND "2019_03"
OR _TABLE_SUFFIX = 'full_corpus_201512')
AND subreddit = 'AskDocs'
AND REGEXP_CONTAINS(selftext, r'[aA]sperger') )
SELECT
posts.date_agg AS Date,
posts.title AS Post,
posts.selftext AS Text,
comments.body AS Comment
FROM
comments
JOIN
posts
ON
posts.id = SUBSTR(comments.link_id, 4)
ORDER BY
Date,
Post
注意:我使用了不同的通配符,因为在filter分区的两个数据集上表都不相同,并且优化了查询计算。