我正在尝试运行一个加入太大数据集的简单查询,但我遇到了各种错误。这里转载的是使用公共数据库的类似查询
SELECT gn1.actor_attributes.blog, gn1.actor_attributes.company, gn1.actor_attributes.email, gn1.actor_attributes.gravatar_id, gn1.actor_attributes.location, gn1.actor_attributes.login, gn1.actor_attributes.name,gn2.actor_attributes.blog, gn2.actor_attributes.company, gn2.actor_attributes.email, gn2.actor_attributes.gravatar_id, gn2.actor_attributes.location, gn2.actor_attributes.login, gn2.actor_attributes.name
FROM [publicdata:samples.github_nested] as gn1 inner join (select actor_attributes.blog, actor_attributes.company,actor_attributes.email, actor_attributes.gravatar_id, actor_attributes.location, actor_attributes.login, actor_attributes.name from [publicdata:samples.github_nested] group by actor_attributes.blog, actor_attributes.company,actor_attributes.email, actor_attributes.gravatar_id, actor_attributes.location, actor_attributes.login, actor_attributes.name) as gn2 on gn1.payload.target.login=gn2.actor_attributes.login
WHERE gn1.type='FollowEvent'
如果没有“内部连接”,则说数据库大小太大。当我使用“内部连接每个”运行查询时,大查询会出现错误说明:
无法执行分区连接,因为gn2不可并行化:(SELECT [actor_attributes.blog],[actor_attributes.company],[actor_attributes.email],[actor_attributes.gravatar_id],[actor_attributes.location],[actor_attributes.login] ,[actor_attributes.name] FROM [publicdata:samples.github_nested] GROUP BY [actor_attributes.blog],[actor_attributes.company],[actor_attributes.email],[actor_attributes.gravatar_id],[actor_attributes.location],[actor_attributes.login ],[actor_attributes.name])
非常感谢任何帮助
谢谢
答案 0 :(得分:1)
感谢您有一个[非]工作的公共示例。使调试更容易。
重新格式化原始查询:
SELECT gn1.actor_attributes.blog, gn1.actor_attributes.company, gn1.actor_attributes.email, gn1.actor_attributes.gravatar_id, gn1.actor_attributes.location, gn1.actor_attributes.login, gn1.actor_attributes.name,gn2.actor_attributes.blog, gn2.actor_attributes.company, gn2.actor_attributes.email, gn2.actor_attributes.gravatar_id, gn2.actor_attributes.location, gn2.actor_attributes.login, gn2.actor_attributes.name
FROM [publicdata:samples.github_nested] AS gn1
INNER JOIN EACH (
SELECT actor_attributes.blog, actor_attributes.company,actor_attributes.email, actor_attributes.gravatar_id, actor_attributes.location, actor_attributes.login, actor_attributes.name
FROM [publicdata:samples.github_nested]
GROUP BY actor_attributes.blog, actor_attributes.company,actor_attributes.email, actor_attributes.gravatar_id, actor_attributes.location, actor_attributes.login, actor_attributes.name)
AS gn2 ON gn1.payload.target.login=gn2.actor_attributes.login
WHERE gn1.type='FollowEvent'
该查询确实因所述错误消息而失败。虽然错误消息可能更好,但解决方案很简单:只需在子查询中将EACH添加到GROUP BY,以使其可并行化:
SELECT gn1.actor_attributes.blog, gn1.actor_attributes.company, gn1.actor_attributes.email, gn1.actor_attributes.gravatar_id, gn1.actor_attributes.location, gn1.actor_attributes.login, gn1.actor_attributes.name,gn2.actor_attributes.blog, gn2.actor_attributes.company, gn2.actor_attributes.email, gn2.actor_attributes.gravatar_id, gn2.actor_attributes.location, gn2.actor_attributes.login, gn2.actor_attributes.name
FROM [publicdata:samples.github_nested] AS gn1
INNER JOIN EACH (
SELECT actor_attributes.blog, actor_attributes.company,actor_attributes.email, actor_attributes.gravatar_id, actor_attributes.location, actor_attributes.login, actor_attributes.name
FROM [publicdata:samples.github_nested]
GROUP EACH BY actor_attributes.blog, actor_attributes.company,actor_attributes.email, actor_attributes.gravatar_id, actor_attributes.location, actor_attributes.login, actor_attributes.name)
AS gn2 ON gn1.payload.target.login=gn2.actor_attributes.login
WHERE gn1.type='FollowEvent'
[查询完成(已过12.7秒,已处理237 MB)]