我正在通过以下自我加入创建Reddit subreddits之间重叠评论员数量的表格:
SELECT t1.subreddit, t2.subreddit, COUNT(*) as NumOverlaps
FROM [fh-bigquery:reddit_comments.2015_05] t1
JOIN [fh-bigquery:reddit_comments.2015_05] t2
ON t1.author=t2.author
WHERE t1.subreddit<t2.subreddit
GROUP BY t1.subreddit, t2.subreddit;
我在Big Query中对此数据集进行的典型查询很快完成(<1分钟),但此查询已运行了一个多小时但仍未完成。该数据有54,504,410行和22列。
我是否错过了我应该实施的明显加速以使此查询快速运行?谢谢!
答案 0 :(得分:3)
尝试以下
SELECT t1.subreddit, t2.subreddit, SUM(t1.cnt*t2.cnt) as NumOverlaps
FROM (SELECT subreddit, author, COUNT(1) as cnt
FROM [fh-bigquery:reddit_comments.2015_05]
GROUP BY subreddit, author HAVING cnt > 1) t1
JOIN (SELECT subreddit, author, COUNT(1) as cnt
FROM [fh-bigquery:reddit_comments.2015_05]
GROUP BY subreddit, author HAVING cnt > 1) t2
ON t1.author=t2.author
WHERE t1.subreddit<t2.subreddit
GROUP BY t1.subreddit, t2.subreddit
它做了两件事 首先,它预先聚合数据以避免冗余加入 其次,它消除了“潜在的异常值” - 那些只有一个subreddit帖子的作者。当然,第二项取决于您的使用案例。但最有可能它应该没问题,从而解决性能问题。如果仍然比您预期的要慢 - 将阈值提高到2或更高
跟进:...... 22,545,850,104 ...似乎不正确...... 它应该是SUM(t1.cnt + t2.cnt)吗?
当然这是不正确的,但如果您能够运行有问题的查询,这正是您所能得到的! 我的希望是你能够抓住这个! 因此,我很高兴修复“性能”问题 - 在原始查询中打开了关于逻辑问题的眼睛!
所以,是的,显然22,545,850,104是不正确的数字 所以,而不是
SUM(t1.cnt*t2.cnt) as NumOverlaps
你应该使用简单的
SUM(1) as NumOverlaps as NumOverlaps
这将为您提供相当于使用
的结果 EXACT_COUNT_DISTINCT(t1.author) as NumOverlaps
原始查询中的
所以,现在尝试下面:
SELECT t1.subreddit, t2.subreddit, SUM(1) as NumOverlaps
FROM (SELECT subreddit, author, COUNT(1) as cnt
FROM [fh-bigquery:reddit_comments.2015_05]
GROUP BY subreddit, author HAVING cnt > 1) t1
JOIN (SELECT subreddit, author, COUNT(1) as cnt
FROM [fh-bigquery:reddit_comments.2015_05]
GROUP BY subreddit, author HAVING cnt > 1) t2
ON t1.author=t2.author
WHERE t1.subreddit<t2.subreddit
GROUP BY t1.subreddit, t2.subreddit