Google BigQuery GROUP BY超时

时间:2016-04-05 10:06:45

标签: sql google-bigquery

我正在尝试通过Google BigQuery从github存档中查询协作者登录,存储库语言和名称。如果我排除GROUP BY,以下查询可以正常工作,但是对于GROUP BY,查询会一直持续到我从google bigquery获取超时。由于Google BigQuery没有DISTINCT,我试图将GROUP BY用作DISTINCT,这样我就不会重复行了。这是我正在使用的查询:

SELECT
    a1.actor_attributes_login,
    a2.actor_attributes_login,
    a1.repository_language,
    a1.repository_name,
FROM
    [githubarchive:year.2014] AS a1
LEFT JOIN
    [githubarchive:year.2014] AS a2
ON
    a1.repository_name = a2.repository_name
WHERE
    a1.actor_attributes_login != a2.actor_attributes_login
    AND a1.actor_attributes_location = "California"
    AND (a1.repository_language = "Java"
      OR a1.repository_language = "Python")
GROUP BY
    a1.actor_attributes_login,
    a2.actor_attributes_login,
    a1.repository_language,
    a1.repository_name
LIMIT
    10000

1 个答案:

答案 0 :(得分:1)

嗯。您可以尝试在执行连接之前删除重复项

SELECT a1.actor_attributes_login, a2.actor_attributes_login,
       a1.repository_language, a1.repository_name
FROM (SELECT a.actor_attributes_login, a.repository_language, a1.repository_name
      FROM githubarchive:year.2014] a
      WHERE a.actor_attributes_location = 'California AND
            a.repository_language IN ('Java', 'Python')
      GROUP BY a.actor_attributes_login, a.repository_language, a.repository_name
     ) a1 LEFT JOIN
     (SELECT a1.actor_attributes_login, a1.repository_language, a1.repository_name
      FROM githubarchive:year.2014] a1
      GROUP BY a1.actor_attributes_login, a1.repository_language, a1.repository_name
     ) a2
     ON a1.repository_name = a2.repository_name
WHERE a1.actor_attributes_login <> a2.actor_attributes_login
LIMIT 10000;

如果您消除子查询中的重复项,我认为您不需要外部GROUP BY

此外,如果您使用ORDER BY,则应该有LIMIT