我有一张大约5M行的表。请注意,这只是一个poc。最终,我们需要处于TB范围内。我正在自我加入,以寻找市场购物篮分析的产品排列。
我需要找到组合在篮子中出现的次数,出现次数与总篮子数的比率,以及项目在所有篮子中出现的次数。这是非常标准的。 BigQuery不支持在另一个select的谓词中选择,所以我需要创建另一个我认为的连接。这就是我想出的 -
select twoItem.upc1,twoItem.upc2,twoItem.twoItemOccurrences, totalUpc.totalUpcCount
from
(
select purchase1.upc as upc1,purchase2.upc as upc2,count(upc1) as twoItemOccurrences
from
conagra.purchase as purchase1
join each conagra.purchase as purchase2
on purchase1.upc = purchase2.upc
group by upc1,upc2
) as twoItem
JOIN EACH
(
select purchase3.upc as upc3, count(*) as totalUpcCount
from conagra.purchase as purchase3
group by upc3
) as totalUpc
on totalUpc.upc3 = twoItem.upc1
LIMIT 50;
我收到以下错误:
SHUFFLE BY
可能只适用于可并行化的查询,但查询不可并行化:(SELECT * FROM (SELECT [purchase3.upc] AS [upc3], COUNT(*) AS [totalUpcCount]...
也许未发表的限制?
任何帮助将不胜感激。
答案 0 :(得分:1)
尝试在内部查询上使用GROUP EACH BY
运行这些内容。我们将改进此类查询的响应消息。