Bigquery:按_PARTITIONTIME过滤不会在LEFT JOIN上传播

时间:2019-05-31 10:17:49

标签: sql google-bigquery

我有2个分区表:

表1:


| user_id | request_id |


表2:


| ip | user_id | request_id |


我想从partition_table2获取所有IP:   -用户数(来自partition_table1)   -用户请求(来自partition_table1)   -用户请求(来自partition_table2)对用户(来自partition_table1)

信息:   IP与表1中的request_id相关,因为一个user_id可以有多个IP。

问题:   当我在主查询中按_PARTITIONTIME进行过滤时,执行LEFT JOIN时不会传播到WITH进行查询,但是当我进行INNER JOIN时,将通过_PARTITIONTIME进行过滤。

分区修剪似乎无效:https://cloud.google.com/bigquery/docs/querying-partitioned-tables用于LEFT JOIN

我的查询

WITH
  users_info AS (
  SELECT
    t2.ip,
    t1.user_id,
    COUNT(DISTINCT t1.request_id) AS user_requests,
    t1._PARTITIONTIME AS date
  FROM partitioned_table1 t1
  INNER JOIN partition_table2 t2
    ON t1.request_id = t2.request_id
    AND t1._PARTITIONTIME = t2._PARTITIONTIME
  GROUP BY t2.ip, t1.user_id, t1._PARTITIONTIME
  )
SELECT
  t2.ip,
  COUNT(DISTINCT m.user_id) AS users,
  COUNT(DISTINCT t2.request_id) AS t2_users_requests,
  SUM(m.user_requests) AS t1_users_requests
FROM partition_table2 t2
LEFT JOIN/INNER JOIN users_info m
  ON t2.ip=m.ip
  AND t2.user_id=m.user_id
  AND m.date = t2._PARTITIONTIME
WHERE DATE(t2._PARTITIONTIME) = "2019-05-20" 
GROUP BY t2.ip

如果我执行INNER JOIN,此查询将处理〜4 GB,但是使用LEFT JOIN它将处理〜3 TB

我做错了事,还是这种行为是预期的?


编辑

我需要此查询来创建一个VIEW。来自上述查询的Condition(DATE(t2._PARTITIONTIME)=“ 2019-05-20”)我将在查询时使用它来过滤VIEW。

1 个答案:

答案 0 :(得分:0)

LEFT OUTER JOIN右侧的列可能为NULL,因此,是的,BigQuery实际上需要执行连接以找出结果,而不是预先过滤分区。如果您不希望出现这种情况,请使用子查询在联接之前在_PARTITIONTIME上进行过滤。