按周/年/月进行分区以超出分区限制?

时间:2019-05-14 07:26:29

标签: google-bigquery database-partitioning

我有32年的数据要放入分区表中。但是BigQuery表示我超出了限制(4000个分区)。

对于像这样的查询

CREATE TABLE `deleting.day_partition`
PARTITION BY FlightDate 
AS 
SELECT *
FROM `flights.original` 

我遇到类似以下错误:

  

查询产生的分区过多,允许2000个,查询至少产生11384个分区

我如何超过此限制?

2 个答案:

答案 0 :(得分:3)

您可以按周/月/年进行分区,而不是按天进行分区。

就我而言,每年的数据包含大约3GB的数据,因此,如果按年份进行分区,我将从群集中获得最大的好处。

为此,我将创建一个year日期列,并对其进行分区:

CREATE TABLE `fh-bigquery.flights.ontime_201903`
PARTITION BY FlightDate_year
CLUSTER BY Origin, Dest 
AS
SELECT *, DATE_TRUNC(FlightDate, YEAR) FlightDate_year
FROM `fh-bigquery.flights.raw_load_fixed`

请注意,我在此过程中创建了额外的列DATE_TRUNC(FlightDate, YEAR) AS FlightDate_year

表格统计:

enter image description here

Since the table is clustered, I'll get the benefits of partitioning,即使我不使用分区列(年份)作为过滤器:

SELECT *
FROM `fh-bigquery.flights.ontime_201903`
WHERE FlightDate BETWEEN '2008-01-01' AND '2008-01-10'

Predicted cost: 83.4 GB
Actual cost: 3.2 GB

答案 1 :(得分:0)

另一个示例,我创建了一个NOAA GSOD摘要表,该表按电台名称聚类-而不是按天分区,我根本没有对其进行分区。

假设我想查找自1980年以来名称为SAN FRAN%的所有电视台最热的日子:

SELECT name, state, ARRAY_AGG(STRUCT(date,temp) ORDER BY temp DESC LIMIT 5) top_hot, MAX(date) active_until
FROM `fh-bigquery.weather_gsod.all` 
WHERE name LIKE 'SAN FRANC%'
AND date > '1980-01-01'
GROUP BY 1,2
ORDER BY active_until DESC

enter image description here

请注意,我仅处理了55.2MB的数据即可得到结果。

对源表的等效查询(不包含集群)将改为处理4GB:

# query on non-clustered tables - too much data compared to the other one
SELECT name, state, ARRAY_AGG(STRUCT(CONCAT(a.year,a.mo,a.da),temp) ORDER BY temp DESC LIMIT 5) top_hot, MAX(CONCAT(a.year,a.mo,a.da)) active_until
FROM `bigquery-public-data.noaa_gsod.gsod*` a
JOIN `bigquery-public-data.noaa_gsod.stations`  b
ON a.wban=b.wban AND a.stn=b.usaf
WHERE name LIKE 'SAN FRANC%'
AND _table_suffix >= '1980'
GROUP BY 1,2
ORDER BY active_until DESC

我还添加了一个地理集群表,以按位置而不是站点名称进行搜索。在此处查看详细信息:https://stackoverflow.com/a/34804655/132438