bigquery分区表

时间:2018-03-12 18:14:56

标签: google-bigquery

我有一个查询使用分析函数为一天分区表。我希望它只读取在where子句中过滤的分区中的数据,但它会读取表中的所有分区。

WITH query AS (
SELECT
  * EXCEPT(rank)
FROM (
  SELECT
    *,
    RANK() OVER (PARTITION BY day order by num_mean_temp_samples) AS rank
  FROM (
    SELECT
      FORMAT_DATE("%Y%m%d", _PARTITIONDATE) AS day,
      *
    FROM
      `mydataset.gsod_partitioned` ) q_nested
  ) q
WHERE
  rank < 1000
)
SELECT
  num_mean_temp_samples ,
  count(1) as samples
FROM query
 WHERE
   day in ( '20100101', '20100103')
GROUP BY  1 ORDER BY 1

我验证了分区修剪没有分析功能:

WITH query AS (
SELECT
  FORMAT_DATE("%Y%m%d", _PARTITIONDATE) AS day,
  *
FROM
  `mydataset.gsod_partitioned`
)

或添加UNION ALL后嵌套选择:

WITH query AS (
SELECT
  * EXCEPT(rank)
FROM (
  SELECT
    *,
    RANK() OVER (PARTITION BY day order by num_mean_temp_samples) AS rank
  FROM (
    SELECT
      FORMAT_DATE("%Y%m%d", _PARTITIONDATE) AS day,
      *
    FROM
      `mydataset.gsod_partitioned` WHERE _PARTITIONDATE < "1970-01-01" ) q_nested1
  UNION ALL SELECT
    *,
    RANK() OVER (PARTITION BY day order by num_mean_temp_samples) AS rank
  FROM (
    SELECT
      FORMAT_DATE("%Y%m%d", _PARTITIONDATE) AS day,
      *
    FROM
      `mydataset.gsod_partitioned` WHERE _PARTITIONDATE >= "1970-01-01" ) q_nested2
  ) q
WHERE
  rank < 1000
)

表mydataset.gsod_partitioned是基于公共数据集的gsod,其中day = 20100101分区创建如下:

bq query --destination_table 'private.gsod_partitioned$20100101' --time_partitioning_type=DAY --use_legacy_sql=false
'SELECT station_number, mean_temp, num_mean_temp_samples FROM `bigquery-public-data.samples.gsod` where year=2010 and month=01 and day=01'

您是否可以找到一种方法来为分析函数启用分区修剪,而无需在查询中添加额外的联合?

1 个答案:

答案 0 :(得分:1)

关于_PARTITIONDATE - 它没有记录功能,建议使用_PARTITIONETIME,你可以寻找其他一些问题,看看Google员工之一:Use of the _PARTITIONDATE vs. the _PARTITIONTIME pseudo-columns in BigQuery

关于使用analitycal函数进行分区修剪去年,Google添加了对过滤器下推的支持,但仅适用于_PARTITIONTIME ,它应包含在PARTITON BY子句所涵盖的字段中< / p>

它应该是这样的:

WITH query AS (
SELECT
  * EXCEPT(rank)
FROM (
  SELECT
    *,
    RANK() OVER (PARTITION BY _pt order by num_mean_temp_samples) AS rank
  FROM (
    SELECT
      FORMAT_TIMESTAMP("%Y%m%d", _PARTITIONTIME) AS day,
      _PARTITIONTIME as _pt,
      *
    FROM
      `mydataset.gsod_partitioned` ) q_nested
  ) q
WHERE
  rank < 1000
)
SELECT
  num_mean_temp_samples ,
  count(1) as samples
FROM query
 WHERE
   day in ( '20100101', '20100103')
GROUP BY  1 ORDER BY 1