自我加入大查询的速度非常慢,我是否遵循最佳做法?

时间:2016-11-12 10:57:12

标签: google-bigquery

我正在通过以下自我加入创建Reddit subreddits之间重叠评论员数量的表格:

SELECT t1.subreddit, t2.subreddit, COUNT(*) as NumOverlaps
FROM [fh-bigquery:reddit_comments.2015_05] t1
JOIN [fh-bigquery:reddit_comments.2015_05] t2
ON t1.author=t2.author
WHERE t1.subreddit<t2.subreddit
GROUP BY t1.subreddit, t2.subreddit;

我在Big Query中对此数据集进行的典型查询很快完成(<1分钟),但此查询已运行了一个多小时但仍未完成。该数据有54,504,410行和22列。

我是否错过了我应该实施的明显加速以使此查询快速运行?谢谢!

1 个答案:

答案 0 :(得分:3)

尝试以下

SELECT t1.subreddit, t2.subreddit, SUM(t1.cnt*t2.cnt) as NumOverlaps
FROM (SELECT subreddit, author, COUNT(1) as cnt 
      FROM [fh-bigquery:reddit_comments.2015_05] 
      GROUP BY subreddit, author HAVING cnt > 1) t1
JOIN (SELECT subreddit, author, COUNT(1) as cnt 
      FROM [fh-bigquery:reddit_comments.2015_05] 
      GROUP BY subreddit, author HAVING cnt > 1) t2
ON t1.author=t2.author
WHERE t1.subreddit<t2.subreddit
GROUP BY t1.subreddit, t2.subreddit

它做了两件事 首先,它预先聚合数据以避免冗余加入 其次,它消除了“潜在的异常值” - 那些只有一个subreddit帖子的作者。当然,第二项取决于您的使用案例。但最有可能它应该没问题,从而解决性能问题。如果仍然比您预期的要慢 - 将阈值提高到2或更高

  

跟进:...... 22,545,850,104 ...似乎不正确......   它应该是SUM(t1.cnt + t2.cnt)吗?

当然这是不正确的,但如果您能够运行有问题的查询,这正是您所能得到的! 我的希望是你能够抓住这个! 因此,我很高兴修复“性能”问题 - 在原始查询中打开了关于逻辑问题的眼睛!

所以,是的,显然22,545,850,104是不正确的数字 所以,而不是

    SUM(t1.cnt*t2.cnt) as NumOverlaps   

你应该使用简单的

    SUM(1) as NumOverlaps as NumOverlaps   

这将为您提供相当于使用

的结果
    EXACT_COUNT_DISTINCT(t1.author) as NumOverlaps   
原始查询中的

所以,现在尝试下面:

SELECT t1.subreddit, t2.subreddit, SUM(1) as NumOverlaps
FROM (SELECT subreddit, author, COUNT(1) as cnt 
      FROM [fh-bigquery:reddit_comments.2015_05] 
      GROUP BY subreddit, author HAVING cnt > 1) t1
JOIN (SELECT subreddit, author, COUNT(1) as cnt 
      FROM [fh-bigquery:reddit_comments.2015_05] 
      GROUP BY subreddit, author HAVING cnt > 1) t2
ON t1.author=t2.author
WHERE t1.subreddit<t2.subreddit
GROUP BY t1.subreddit, t2.subreddit