Google BigQuery会加入每个错误

时间:2013-11-10 11:38:28

标签: google-bigquery

我正在尝试运行一个加入太大数据集的简单查询,但我遇到了各种错误。这里转载的是使用公共数据库的类似查询

SELECT gn1.actor_attributes.blog, gn1.actor_attributes.company, gn1.actor_attributes.email, gn1.actor_attributes.gravatar_id, gn1.actor_attributes.location, gn1.actor_attributes.login, gn1.actor_attributes.name,gn2.actor_attributes.blog, gn2.actor_attributes.company, gn2.actor_attributes.email, gn2.actor_attributes.gravatar_id, gn2.actor_attributes.location, gn2.actor_attributes.login, gn2.actor_attributes.name
FROM [publicdata:samples.github_nested] as gn1 inner join (select actor_attributes.blog, actor_attributes.company,actor_attributes.email, actor_attributes.gravatar_id, actor_attributes.location, actor_attributes.login, actor_attributes.name from [publicdata:samples.github_nested] group by actor_attributes.blog, actor_attributes.company,actor_attributes.email, actor_attributes.gravatar_id, actor_attributes.location, actor_attributes.login, actor_attributes.name) as gn2 on gn1.payload.target.login=gn2.actor_attributes.login
WHERE gn1.type='FollowEvent'

如果没有“内部连接”,则说数据库大小太大。当我使用“内部连接每个”运行查询时,大查询会出现错误说明:

无法执行分区连接,因为gn2不可并行化:(SELECT [actor_attributes.blog],[actor_attributes.company],[actor_attributes.email],[actor_attributes.gravatar_id],[actor_attributes.location],[actor_attributes.login] ,[actor_attributes.name] FROM [publicdata:samples.github_nested] GROUP BY [actor_attributes.blog],[actor_attributes.company],[actor_attributes.email],[actor_attributes.gravatar_id],[actor_attributes.location],[actor_attributes.login ],[actor_attributes.name])

非常感谢任何帮助

谢谢

1 个答案:

答案 0 :(得分:1)

感谢您有一个[非]工作的公共示例。使调试更容易。

重新格式化原始查询:

SELECT gn1.actor_attributes.blog, gn1.actor_attributes.company, gn1.actor_attributes.email, gn1.actor_attributes.gravatar_id, gn1.actor_attributes.location, gn1.actor_attributes.login, gn1.actor_attributes.name,gn2.actor_attributes.blog, gn2.actor_attributes.company, gn2.actor_attributes.email, gn2.actor_attributes.gravatar_id, gn2.actor_attributes.location, gn2.actor_attributes.login, gn2.actor_attributes.name
FROM [publicdata:samples.github_nested] AS gn1 
INNER JOIN EACH (
  SELECT actor_attributes.blog, actor_attributes.company,actor_attributes.email, actor_attributes.gravatar_id, actor_attributes.location, actor_attributes.login, actor_attributes.name
  FROM [publicdata:samples.github_nested]
  GROUP BY actor_attributes.blog, actor_attributes.company,actor_attributes.email, actor_attributes.gravatar_id, actor_attributes.location, actor_attributes.login, actor_attributes.name) 
AS gn2 ON gn1.payload.target.login=gn2.actor_attributes.login
WHERE gn1.type='FollowEvent'

该查询确实因所述错误消息而失败。虽然错误消息可能更好,但解决方案很简单:只需在子查询中将EACH添加到GROUP BY,以使其可并行化:

SELECT gn1.actor_attributes.blog, gn1.actor_attributes.company, gn1.actor_attributes.email, gn1.actor_attributes.gravatar_id, gn1.actor_attributes.location, gn1.actor_attributes.login, gn1.actor_attributes.name,gn2.actor_attributes.blog, gn2.actor_attributes.company, gn2.actor_attributes.email, gn2.actor_attributes.gravatar_id, gn2.actor_attributes.location, gn2.actor_attributes.login, gn2.actor_attributes.name
FROM [publicdata:samples.github_nested] AS gn1 
INNER JOIN EACH (
  SELECT actor_attributes.blog, actor_attributes.company,actor_attributes.email, actor_attributes.gravatar_id, actor_attributes.location, actor_attributes.login, actor_attributes.name
  FROM [publicdata:samples.github_nested]
  GROUP EACH BY actor_attributes.blog, actor_attributes.company,actor_attributes.email, actor_attributes.gravatar_id, actor_attributes.location, actor_attributes.login, actor_attributes.name) 
AS gn2 ON gn1.payload.target.login=gn2.actor_attributes.login
WHERE gn1.type='FollowEvent'

[查询完成(已过12.7秒,已处理237 MB)]