Question

我正在尝试运行一个加入太大数据集的简单查询，但我遇到了各种错误。这里转载的是使用公共数据库的类似查询

SELECT gn1.actor_attributes.blog, gn1.actor_attributes.company, gn1.actor_attributes.email, gn1.actor_attributes.gravatar_id, gn1.actor_attributes.location, gn1.actor_attributes.login, gn1.actor_attributes.name,gn2.actor_attributes.blog, gn2.actor_attributes.company, gn2.actor_attributes.email, gn2.actor_attributes.gravatar_id, gn2.actor_attributes.location, gn2.actor_attributes.login, gn2.actor_attributes.name
FROM [publicdata:samples.github_nested] as gn1 inner join (select actor_attributes.blog, actor_attributes.company,actor_attributes.email, actor_attributes.gravatar_id, actor_attributes.location, actor_attributes.login, actor_attributes.name from [publicdata:samples.github_nested] group by actor_attributes.blog, actor_attributes.company,actor_attributes.email, actor_attributes.gravatar_id, actor_attributes.location, actor_attributes.login, actor_attributes.name) as gn2 on gn1.payload.target.login=gn2.actor_attributes.login
WHERE gn1.type='FollowEvent'

如果没有“内部连接”，则说数据库大小太大。当我使用“内部连接每个”运行查询时，大查询会出现错误说明：

无法执行分区连接，因为gn2不可并行化：（SELECT [actor_attributes.blog]，[actor_attributes.company]，[actor_attributes.email]，[actor_attributes.gravatar_id]，[actor_attributes.location]，[actor_attributes.login] ，[actor_attributes.name] FROM [publicdata：samples.github_nested] GROUP BY [actor_attributes.blog]，[actor_attributes.company]，[actor_attributes.email]，[actor_attributes.gravatar_id]，[actor_attributes.location]，[actor_attributes.login ]，[actor_attributes.name]）

非常感谢任何帮助

谢谢

Answer 1

感谢您有一个[非]工作的公共示例。使调试更容易。

重新格式化原始查询：

SELECT gn1.actor_attributes.blog, gn1.actor_attributes.company, gn1.actor_attributes.email, gn1.actor_attributes.gravatar_id, gn1.actor_attributes.location, gn1.actor_attributes.login, gn1.actor_attributes.name,gn2.actor_attributes.blog, gn2.actor_attributes.company, gn2.actor_attributes.email, gn2.actor_attributes.gravatar_id, gn2.actor_attributes.location, gn2.actor_attributes.login, gn2.actor_attributes.name
FROM [publicdata:samples.github_nested] AS gn1 
INNER JOIN EACH (
  SELECT actor_attributes.blog, actor_attributes.company,actor_attributes.email, actor_attributes.gravatar_id, actor_attributes.location, actor_attributes.login, actor_attributes.name
  FROM [publicdata:samples.github_nested]
  GROUP BY actor_attributes.blog, actor_attributes.company,actor_attributes.email, actor_attributes.gravatar_id, actor_attributes.location, actor_attributes.login, actor_attributes.name) 
AS gn2 ON gn1.payload.target.login=gn2.actor_attributes.login
WHERE gn1.type='FollowEvent'

该查询确实因所述错误消息而失败。虽然错误消息可能更好，但解决方案很简单：只需在子查询中将EACH添加到GROUP BY，以使其可并行化：

SELECT gn1.actor_attributes.blog, gn1.actor_attributes.company, gn1.actor_attributes.email, gn1.actor_attributes.gravatar_id, gn1.actor_attributes.location, gn1.actor_attributes.login, gn1.actor_attributes.name,gn2.actor_attributes.blog, gn2.actor_attributes.company, gn2.actor_attributes.email, gn2.actor_attributes.gravatar_id, gn2.actor_attributes.location, gn2.actor_attributes.login, gn2.actor_attributes.name
FROM [publicdata:samples.github_nested] AS gn1 
INNER JOIN EACH (
  SELECT actor_attributes.blog, actor_attributes.company,actor_attributes.email, actor_attributes.gravatar_id, actor_attributes.location, actor_attributes.login, actor_attributes.name
  FROM [publicdata:samples.github_nested]
  GROUP EACH BY actor_attributes.blog, actor_attributes.company,actor_attributes.email, actor_attributes.gravatar_id, actor_attributes.location, actor_attributes.login, actor_attributes.name) 
AS gn2 ON gn1.payload.target.login=gn2.actor_attributes.login
WHERE gn1.type='FollowEvent'

[查询完成（已过12.7秒，已处理237 MB）]

Google BigQuery会加入每个错误

1 个答案: