Question

我正在通过以下自我加入创建Reddit subreddits之间重叠评论员数量的表格：

SELECT t1.subreddit, t2.subreddit, COUNT(*) as NumOverlaps
FROM [fh-bigquery:reddit_comments.2015_05] t1
JOIN [fh-bigquery:reddit_comments.2015_05] t2
ON t1.author=t2.author
WHERE t1.subreddit<t2.subreddit
GROUP BY t1.subreddit, t2.subreddit;

我在Big Query中对此数据集进行的典型查询很快完成（<1分钟），但此查询已运行了一个多小时但仍未完成。该数据有54,504,410行和22列。

我是否错过了我应该实施的明显加速以使此查询快速运行？谢谢！

Answer 1

尝试以下

SELECT t1.subreddit, t2.subreddit, SUM(t1.cnt*t2.cnt) as NumOverlaps
FROM (SELECT subreddit, author, COUNT(1) as cnt 
      FROM [fh-bigquery:reddit_comments.2015_05] 
      GROUP BY subreddit, author HAVING cnt > 1) t1
JOIN (SELECT subreddit, author, COUNT(1) as cnt 
      FROM [fh-bigquery:reddit_comments.2015_05] 
      GROUP BY subreddit, author HAVING cnt > 1) t2
ON t1.author=t2.author
WHERE t1.subreddit<t2.subreddit
GROUP BY t1.subreddit, t2.subreddit

它做了两件事首先，它预先聚合数据以避免冗余加入其次，它消除了“潜在的异常值” - 那些只有一个subreddit帖子的作者。当然，第二项取决于您的使用案例。但最有可能它应该没问题，从而解决性能问题。如果仍然比您预期的要慢 - 将阈值提高到2或更高

跟进：...... 22,545,850,104 ...似乎不正确...... 它应该是SUM（t1.cnt + t2.cnt）吗？

当然这是不正确的，但如果您能够运行有问题的查询，这正是您所能得到的！我的希望是你能够抓住这个！因此，我很高兴修复“性能”问题 - 在原始查询中打开了关于逻辑问题的眼睛！

所以，是的，显然22,545,850,104是不正确的数字所以，而不是

    SUM(t1.cnt*t2.cnt) as NumOverlaps

你应该使用简单的

    SUM(1) as NumOverlaps as NumOverlaps

这将为您提供相当于使用

的结果

    EXACT_COUNT_DISTINCT(t1.author) as NumOverlaps

原始查询中的

所以，现在尝试下面：

SELECT t1.subreddit, t2.subreddit, SUM(1) as NumOverlaps
FROM (SELECT subreddit, author, COUNT(1) as cnt 
      FROM [fh-bigquery:reddit_comments.2015_05] 
      GROUP BY subreddit, author HAVING cnt > 1) t1
JOIN (SELECT subreddit, author, COUNT(1) as cnt 
      FROM [fh-bigquery:reddit_comments.2015_05] 
      GROUP BY subreddit, author HAVING cnt > 1) t2
ON t1.author=t2.author
WHERE t1.subreddit<t2.subreddit
GROUP BY t1.subreddit, t2.subreddit

自我加入大查询的速度非常慢，我是否遵循最佳做法？

1 个答案: