我正在尝试在github存档(http://www.githubarchive.org/)数据上使用Google BigQuery来获取最新事件发生时的存储库统计信息,而我正在尝试将这个数据库用于拥有最多观察者的存储库。我意识到这很多但我觉得我真的很接近在一个查询中得到它。
这是我现在的查询:
SELECT repository_name, repository_owner, repository_organization, repository_size, repository_watchers as watchers, repository_forks as forks, repository_language, MAX(PARSE_UTC_USEC(created_at)) as time
FROM [githubarchive:github.timeline]
GROUP EACH BY repository_name, repository_owner, repository_organization, repository_size, watchers, forks, repository_language
ORDER BY watchers DESC, time DESC
LIMIT 1000
唯一的问题是我收到了来自最受关注的存储库(twitter bootstrap)的所有事件:
结果:
Row repository_name repository_owner repository_organization repository_size watchers forks repository_language time
1 bootstrap twbs twbs 83875 61191 21602 JavaScript 1384991582000000
2 bootstrap twbs twbs 83875 61190 21602 JavaScript 1384991337000000
3 bootstrap twbs twbs 83875 61190 21603 JavaScript 1384989683000000
...
我怎样才能让它返回repository_name的单个结果(最新的,又称Max(时间))?
我试过了:
SELECT repository_name, repository_owner, repository_organization, repository_size, repository_watchers as watchers, repository_forks as forks, repository_language, MAX(PARSE_UTC_USEC(created_at)) as time
FROM [githubarchive:github.timeline]
WHERE PARSE_UTC_USEC(created_at) IN (SELECT MAX(PARSE_UTC_USEC(created_at)) FROM [githubarchive:github.timeline])
GROUP EACH BY repository_name, repository_owner, repository_organization, repository_size, watchers, forks, repository_language
ORDER BY watchers DESC, time DESC
LIMIT 1000
不确定这是否有效但无关紧要,因为我收到错误消息:
Error: Join attribute is not defined: PARSE_UTC_USEC
任何帮助都会很棒,谢谢。
答案 0 :(得分:4)
该查询的一个问题是,如果有两个操作同时发生,您的结果可能会混淆。如果您只是按存储库名称分组以获取每个存储库的最大提交时间,然后加入其中以获取所需的其他字段,则可以获得所需的内容。 E.g:
SELECT
a.repository_name as name,
a.repository_owner as owner,
a.repository_organization as organization,
a.repository_size as size,
a.repository_watchers AS watchers,
a.repository_forks AS forks,
a.repository_language as language,
PARSE_UTC_USEC(created_at) AS time
FROM [githubarchive:github.timeline] a
JOIN EACH
(
SELECT MAX(created_at) as max_created, repository_name
FROM [githubarchive:github.timeline]
GROUP EACH BY repository_name
) b
ON
b.max_created = a.created_at and
b.repository_name = a.repository_name
ORDER BY watchers desc
LIMIT 1000