Bigquery-使用ROW_NUMBER()从10M表创建bin时,在查询执行期间超出了资源

时间:2019-04-24 16:08:30

标签: google-bigquery

我有一张约有1000万条记录的表,其中每条记录是一个ID和一些概率(范围在0到1之间)。 所有ID都是唯一的。我正在尝试将这10M数据集分成1,000个bin-意味着每个bin中将有10k条记录。 但是我想根据概率来计算这些垃圾箱,因此我首先按照概率的降序排列表格 然后我尝试创建垃圾箱。

--10M dataset
with predictions as
(
select id ,probability
from table
)

-- give a rom_number to each record and then create 1000 groups
, bin_groups as (
select
id,
ceiling(1000.0*ROW_NUMBER() over(order by probability desc) / (select count(distinct id) from predictions)) as bins
from predictions
)

select *
from bin_groups 
where bins = 1 
limit 100

但是,执行此查询时出现以下错误-

Resources exceeded during query execution: The query could not be executed in the allotted memory. Peak usage: 102% of limit. Top memory consumer(s): JOIN operations: 96% other/unattributed: 4%

我在这里阅读-https://cloud.google.com/bigquery/docs/best-practices-performance-output#use_a_limit_clause_with_large_sorts,我们需要在查询时限制结果,但似乎LIMIT也不起作用。

1 个答案:

答案 0 :(得分:0)

您已经实现了上面的2条select语句之后发生限制,因此将限制添加到外部是行不通的。您可能必须将限制放入bin_groups中,尽管我不确定它是否仍然适合您用例。

--10M dataset
with predictions as
(
select id ,probability
from table
)

-- give a rom_number to each record and then create 1000 groups
, bin_groups as (
select
id,
ceiling(1000.0*ROW_NUMBER() over(order by probability desc) / (select count(distinct 
id) from predictions)) as bins
from predictions
limit 100
)

select *
from bin_groups 
where bins = 1