BigQuery过滤自文本,该文本包含在subreddit中所有可能的帖子中的单词

时间:2019-07-04 10:33:41

标签: google-bigquery reddit

我正试图在有关Asperger的AskDocs subreddit中获取帖子及其评论,此sql可以很好地获取帖子

#standardSQL
SELECT
  TIMESTAMP_TRUNC(TIMESTAMP_SECONDS(created_utc), MONTH, 'America/New_York') AS date_agg, title,selftext

FROM
  `fh-bigquery.reddit_posts.*`
WHERE
  (_TABLE_SUFFIX BETWEEN "2016_01" AND "2019_03" OR _TABLE_SUFFIX = 'full_corpus_201512')
  AND subreddit = 'AskDocs'
  AND REGEXP_CONTAINS(selftext, r'Asperger')

ORDER BY
  date_agg

但是我不确定是否可以得到所有可用的帖子,我有169行,但是我试图在AskDocs中获得尽可能多的关于这一主题的信息。

第二个问题是因为我试图将每个帖子及其评论链接在一起,所以我在这里找到了

#standardSQL
SELECT posts.title, comments.body
FROM `fh-bigquery.reddit_comments.2016_01` AS comments
JOIN `fh-bigquery.reddit_posts.2016_01`  AS posts
ON posts.id = SUBSTR(comments.link_id, 4) 
WHERE posts.id = '43go1r'

但是当我尝试在此处合并代码时,我确实感到一团糟

1 个答案:

答案 0 :(得分:0)

对于第一个查询,由于在正则表达式中使用了大写字母A,因此将获得169行,并且仅列出包含单词Asperger的自文本,例如: Asperger Asperger's Aspergers 等。标题中包含 asperger asperger's aspergers ,将不会列出,因为您仅在正则表达式中使用大写字母A。要列出包含小写字母的单词,您需要在正则表达式中使用 [aA] ,它将显示241行:

AND REGEXP_CONTAINS(posts.selftext, r'[aA]sperger')

要连接表,可以使用以下查询:

WITH
  comments AS (
  SELECT
    link_id,
    body
  FROM
    `fh-bigquery.reddit_comments.201*`
  WHERE
    _TABLE_SUFFIX BETWEEN "6_01"
    AND "9_03"
    AND subreddit = 'AskDocs' ),
  posts AS (
  SELECT
    TIMESTAMP_TRUNC(TIMESTAMP_SECONDS(created_utc), MONTH, 'America/New_York') AS date_agg,
    id,
    selftext,
    title
  FROM
    `fh-bigquery.reddit_posts.*`
  WHERE
    (_TABLE_SUFFIX BETWEEN "2016_01"
      AND "2019_03"
      OR _TABLE_SUFFIX = 'full_corpus_201512')
    AND subreddit = 'AskDocs'
    AND REGEXP_CONTAINS(selftext, r'[aA]sperger') )
SELECT
  posts.date_agg AS Date,
  posts.title AS Post,
  posts.selftext AS Text,
  comments.body AS Comment
FROM
  comments
JOIN
  posts
ON
  posts.id = SUBSTR(comments.link_id, 4)
ORDER BY
  Date,
  Post

注意:我使用了不同的通配符,因为在filter分区的两个数据集上表都不相同,并且优化了查询计算。