Question

我有一张约有1000万条记录的表，其中每条记录是一个ID和一些概率（范围在0到1之间）。所有ID都是唯一的。我正在尝试将这10M数据集分成1,000个bin-意味着每个bin中将有10k条记录。但是我想根据概率来计算这些垃圾箱，因此我首先按照概率的降序排列表格然后我尝试创建垃圾箱。

--10M dataset
with predictions as
(
select id ,probability
from table
)

-- give a rom_number to each record and then create 1000 groups
, bin_groups as (
select
id,
ceiling(1000.0*ROW_NUMBER() over(order by probability desc) / (select count(distinct id) from predictions)) as bins
from predictions
)

select *
from bin_groups 
where bins = 1 
limit 100

但是，执行此查询时出现以下错误-

Resources exceeded during query execution: The query could not be executed in the allotted memory. Peak usage: 102% of limit. Top memory consumer(s): JOIN operations: 96% other/unattributed: 4%

我在这里阅读-https://cloud.google.com/bigquery/docs/best-practices-performance-output#use_a_limit_clause_with_large_sorts，我们需要在查询时限制结果，但似乎LIMIT也不起作用。

Answer 1

您已经实现了上面的2条select语句之后发生限制，因此将限制添加到外部是行不通的。您可能必须将限制放入bin_groups中，尽管我不确定它是否仍然适合您用例。

--10M dataset
with predictions as
(
select id ,probability
from table
)

-- give a rom_number to each record and then create 1000 groups
, bin_groups as (
select
id,
ceiling(1000.0*ROW_NUMBER() over(order by probability desc) / (select count(distinct 
id) from predictions)) as bins
from predictions
limit 100
)

select *
from bin_groups 
where bins = 1

Bigquery-使用ROW_NUMBER（）从10M表创建bin时，在查询执行期间超出了资源

1 个答案: