Google BigQuery:如何为查询结果中的值获取不同的行

时间:2013-11-21 01:25:54

标签: google-bigquery github-archive

我正在尝试在github存档(http://www.githubarchive.org/)数据上使用Google BigQuery来获取最新事件发生时的存储库统计信息,而我正在尝试将这个数据库用于拥有最多观察者的存储库。我意识到这很多但我觉得我真的很接近在一个查询中得到它。

这是我现在的查询:

SELECT repository_name, repository_owner, repository_organization, repository_size,  repository_watchers as watchers, repository_forks as forks, repository_language, MAX(PARSE_UTC_USEC(created_at)) as time
FROM [githubarchive:github.timeline]
GROUP EACH BY repository_name, repository_owner, repository_organization, repository_size, watchers, forks, repository_language
ORDER BY watchers DESC, time DESC
LIMIT 1000

唯一的问题是我收到了来自最受关注的存储库(twitter bootstrap)的所有事件:

结果:

Row repository_name repository_owner    repository_organization repository_size watchers    forks   repository_language time     
1   bootstrap           twbs                    twbs                   83875      61191     21602   JavaScript          1384991582000000     
2   bootstrap           twbs                    twbs                   83875      61190     21602   JavaScript          1384991337000000     
3   bootstrap           twbs                    twbs                   83875      61190     21603   JavaScript          1384989683000000

...

我怎样才能让它返回repository_name的单个结果(最新的,又称Max(时间))?

我试过了:

SELECT repository_name, repository_owner, repository_organization, repository_size, repository_watchers as watchers, repository_forks as forks, repository_language, MAX(PARSE_UTC_USEC(created_at)) as time
FROM [githubarchive:github.timeline]
WHERE PARSE_UTC_USEC(created_at) IN (SELECT MAX(PARSE_UTC_USEC(created_at)) FROM [githubarchive:github.timeline])
GROUP EACH BY repository_name, repository_owner, repository_organization, repository_size, watchers, forks, repository_language
ORDER BY watchers DESC, time DESC
LIMIT 1000

不确定这是否有效但无关紧要,因为我收到错误消息:

Error: Join attribute is not defined: PARSE_UTC_USEC

任何帮助都会很棒,谢谢。

1 个答案:

答案 0 :(得分:4)

该查询的一个问题是,如果有两个操作同时发生,您的结果可能会混淆。如果您只是按存储库名称分组以获取每个存储库的最大提交时间,然后加入其中以获取所需的其他字段,则可以获得所需的内容。 E.g:

SELECT
  a.repository_name as name,
  a.repository_owner as owner,
  a.repository_organization as organization,
  a.repository_size as size,
  a.repository_watchers AS watchers,
  a.repository_forks AS forks,
  a.repository_language as language,
  PARSE_UTC_USEC(created_at) AS time  
FROM [githubarchive:github.timeline] a
JOIN EACH
  (
     SELECT MAX(created_at) as max_created, repository_name 
     FROM [githubarchive:github.timeline]
     GROUP EACH BY repository_name
  ) b
  ON 
  b.max_created = a.created_at and
  b.repository_name = a.repository_name
ORDER BY watchers desc
LIMIT 1000