BigQuery - 从每个subreddit

时间:2017-06-18 00:21:25

标签: sql google-bigquery reddit

我正在对Google BigQuery上的Reddit数据进行数据挖掘,而我想要按照整个201704数据的每个subreddit的得分排名前1000位。我尝试了不同的技术,但由于BigQuery的限制,结果太大而无法返回。

select body, score, subreddit from 
  (
    select body, score, subreddit,row_number() over 
      (
        partition by subreddit order by score desc
      ) mm 
      from [fh-bigquery:reddit_comments.2017_04]
  )
  where mm <= 1000 AND subreddit in 
  (
    select subreddit from 
    (
      select Count(subreddit) as counts, subreddit from 
      [fh-bigquery:reddit_comments.2017_04] GROUP BY subreddit ORDER BY counts DESC 
      LIMIT 10000
    )
  )
LIMIT 10000000

有没有办法划分和克服这个问题,因为启用大型查询结果意味着无法进行任何复杂的查询。 Google是否为大型查询资源提供付款选项?

1 个答案:

答案 0 :(得分:4)

  

我想在整个201704数据中按每个subreddit的得分排名前1000个帖子

我刚测试了这个查询:

SELECT 
  subreddit,
  ARRAY_AGG(STRUCT(body, score) ORDER BY score DESC LIMIT 1000) data
FROM `fh-bigquery.reddit_comments.2017_04`
GROUP BY 1

它在22秒内处理了整个数据集:

enter image description here

在您的查询中,您似乎想要获得前10000名最受欢迎的subreddits的帖子和分数。我试过这个问题:

SELECT 
  subreddit,
  ARRAY_AGG(STRUCT(body, score) ORDER BY score DESC LIMIT 1000) data
FROM `fh-bigquery.reddit_comments.2017_04`
WHERE subreddit IN(
  SELECT subreddit FROM(
    SELECT
      subreddit
    FROM `fh-bigquery.reddit_comments.2017_04`               
    GROUP BY subreddit
    ORDER BY count(body) DESC
    LIMIT 10000)
  )
GROUP BY 1

并在26s取得了成绩:

enter image description here

希望这些结果是您正在寻找的。如果一切正确,请告诉我。