我有一张约有1000万条记录的表,其中每条记录是一个ID和一些概率(范围在0到1之间)。 所有ID都是唯一的。我正在尝试将这10M数据集分成1,000个bin-意味着每个bin中将有10k条记录。 但是我想根据概率来计算这些垃圾箱,因此我首先按照概率的降序排列表格 然后我尝试创建垃圾箱。
--10M dataset
with predictions as
(
select id ,probability
from table
)
-- give a rom_number to each record and then create 1000 groups
, bin_groups as (
select
id,
ceiling(1000.0*ROW_NUMBER() over(order by probability desc) / (select count(distinct id) from predictions)) as bins
from predictions
)
select *
from bin_groups
where bins = 1
limit 100
但是,执行此查询时出现以下错误-
Resources exceeded during query execution: The query could not be executed in the allotted memory. Peak usage: 102% of limit. Top memory consumer(s): JOIN operations: 96% other/unattributed: 4%
我在这里阅读-https://cloud.google.com/bigquery/docs/best-practices-performance-output#use_a_limit_clause_with_large_sorts,我们需要在查询时限制结果,但似乎LIMIT也不起作用。
答案 0 :(得分:0)
您已经实现了上面的2条select语句之后发生限制,因此将限制添加到外部是行不通的。您可能必须将限制放入bin_groups中,尽管我不确定它是否仍然适合您用例。
--10M dataset
with predictions as
(
select id ,probability
from table
)
-- give a rom_number to each record and then create 1000 groups
, bin_groups as (
select
id,
ceiling(1000.0*ROW_NUMBER() over(order by probability desc) / (select count(distinct
id) from predictions)) as bins
from predictions
limit 100
)
select *
from bin_groups
where bins = 1