BigQuery Reddit评论数据分析

时间:2016-04-19 22:33:15

标签: google-bigquery reddit

BigQuery - 新手

尝试让一对同时评论前10个子评价的用户和使用BigQuery Reddit数据评论的公共子评价数

我刚开始使用BQ和SQL的初学者,我发现很难得到这个查询。有人可以给我一些指导吗?

1 个答案:

答案 0 :(得分:2)

从来没有真正需要使用reddit数据所以下面只是为了让你开始至少投入一些东西,因为似乎没有人愿意。

快速逻辑:

Step - 1: Identify top 10 most commented subreddits  
SELECT subreddit 
FROM [fh-bigquery:reddit_comments.subr_rank_201505] 
ORDER BY comments 
DESC LIMIT 10

步骤2:对于每个subreddit,识别[solid]用户(超过50条评论)

SELECT author, subreddit, COUNT(1) AS comments 
FROM [fh-bigquery:reddit_comments.2016_01]  
WHERE subreddit IN (
    SELECT subreddit 
    FROM [fh-bigquery:reddit_comments.subr_rank_201505] 
    ORDER BY comments DESC 
    LIMIT 10)
AND author NOT IN ('AutoModerator', '[deleted]')
GROUP BY author, subreddit 
HAVING comments > 50 

步骤3:为每个subreddit标识一对普通用户(通过JOIN) 步骤4:最后,为每对用户计算共同子编号的数量

SELECT usera, userb, COUNT(1) AS subreddits
FROM (
  SELECT 
    a.author AS usera, 
    b.author AS userb, 
    a.subreddit AS subreddit,
  FROM (
    SELECT author, subreddit, COUNT(1) AS comments FROM [fh-bigquery:reddit_comments.2016_01]
    WHERE subreddit IN (SELECT subreddit FROM [fh-bigquery:reddit_comments.subr_rank_201505] ORDER BY comments DESC LIMIT 10)
    AND author NOT IN ('AutoModerator', '[deleted]')
    GROUP BY author, subreddit HAVING comments > 50 ) AS a
  JOIN (
    SELECT author, subreddit, COUNT(1) AS comments FROM [fh-bigquery:reddit_comments.2016_01]
    WHERE subreddit IN (SELECT subreddit FROM [fh-bigquery:reddit_comments.subr_rank_201505] ORDER BY comments DESC LIMIT 10)
    AND author NOT IN ('AutoModerator', '[deleted]')
    GROUP BY author, subreddit HAVING comments > 50 ) AS b
  ON a.subreddit = b.subreddit
  WHERE a.author < b.author 
)
GROUP BY usera, userb
HAVING subreddits > 3
ORDER BY subreddits DESC, usera, userb

希望这有帮助