Question

我正在尝试在给定的subreddit的最受好评的帖子中收集最受好评的（假设是前20条）。

任何帮助将不胜感激！

我已经使用了在bigquery中使用的这段代码，但是如果没有重复的问题，我似乎无法获得帖子分数和评论分数。

SELECT posts.title, posts.score, comments.body, posts.subreddit
FROM `fh-bigquery.reddit_comments.2018_10` AS comments
JOIN `fh-bigquery.reddit_posts.2018_10`  AS posts
ON posts.id = SUBSTR(comments.link_id, 4) 
WHERE posts.subreddit = 'Showerthoughts'

对于一个简化的示例，我希望能够看到：

帖子标题1 |帖子分数| （在帖子标题1内）评论正文1 |   评论分数

帖子标题1 |帖子分数| （在帖子标题1内）评论正文2 |   评论分数

帖子标题2 |帖子分数| （在帖子标题2内）评论正文1 |   评论分数

帖子标题2 |帖子分数| （在帖子标题2内）评论正文2 |   评论分数

Answer 1

这是解决重复文本斑点问题的快速方法：

select title, score, body, subreddit from (
    SELECT 
        to_hex(md5(posts.title)), 
        array_agg(posts.title)[offset(0)] as title, 
        array_agg(comments.body)[offset(0)] as body, 
        array_agg(posts.score)[offset(0)] as score, 
        array_agg(posts.subreddit)[offset(0)] as subreddit
    FROM `fh-bigquery.reddit_comments.2018_10` AS comments
    JOIN `fh-bigquery.reddit_posts.2018_10`  AS posts
    ON posts.id = SUBSTR(comments.link_id, 4) 
    WHERE posts.subreddit = 'Showerthoughts'
    group by 1
    order by 1
)

这个想法是将昂贵的文本blob转换为md5哈希，然后使用唯一条目开始您的日常业务。您可以根据需要从这些不同的值中对内容进行排序。

需要帮助按热门帖子查询热门reddit评论

1 个答案: