使用子选择查询对字段进行聚类过滤

时间:2019-04-04 09:27:18

标签: google-bigquery

使用Google Bigquery,我通过在聚类字段projectId上应用过滤器来查询聚簇表,如下所示:

WITH userProjects AS (

    SELECT 
        projectsArray 
    FROM 
        projectsPerUser 
    WHERE 
        userId = "eben@somewhere.com"
)

SELECT 
    userProperty
FROM 
    `mydata.mydataset.mytable`
WHERE 
    --projectId IN UNNEST((SELECT projectsArray FROM userProjects))
    projectId IN ("mydata", "anotherproject")
    AND _PARTITIONTIME >= "2019-03-20"

在上面的代码片段中正确应用了集群,但是当我使用注释行--projectId IN UNNEST((SELECT projectsArray FROM userProjects))时,集群不适用。

我也尝试过将其包装在这样的UDF中,这也不起作用:

CREATE TEMP FUNCTION storedValue(item ARRAY<STRING>) AS (
  item
);

...

WHERE projectId IN UNNEST(storedValue((SELECT projectsListArray FROM projectsList)))

据我了解,子选择查询的执行路径不同于仅直接在标量或数组上进行过滤。

我希望存在一个解决方案,在该方案中我可以以编程方式提供要过滤的数组,这仍将使我获得聚簇表提供的成本优势。

总结:

  1. WHERE projectId IN ("mydata", "anotherproject") [确定]
  2. WHERE projectId IN UNNEST((SELECT projectsArray FROM userProjects)) [不正常]
  3. WHERE projectId IN UNNEST(storedValue((SELECT projectsListArray FROM projectsList))) [不正常]

有什么想法吗?

2 个答案:

答案 0 :(得分:1)

我的建议是重写查询,以使嵌套的SELECT是一个临时表(您已经完成),然后使用INNER JOIN而不是一组成员资格测试来执行所需的过滤,因此查询将变成这样:

WITH userProjects AS (

    SELECT 
        projectsArray 
    FROM 
        projectsPerUser 
    WHERE 
        userId = "eben@somewhere.com"
)

SELECT 
    userProperty
FROM 
    `mydata.mydataset.mytable` as a
    JOIN
    userProjects as b
    ON a.projectId = b.projectsArray
WHERE 
    AND _PARTITIONTIME >= "2019-03-20"

我相信,如果该字段是集群的,这将导致查询不扫描整个分区。

答案 1 :(得分:1)

FWIW,使用动态过滤器,群集对我来说效果很好:

SELECT title, SUM(views) views
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE DATE(TIMESTAMP_TRUNC(datehour, DAY)) = '2019-01-01'
AND wiki='en'
AND title IN ('Dogfight_(disambiguation)','Dogfight','Dogfight_(film)')
GROUP BY 1

1.8 sec elapsed, 364 MB processed

如果我愿意

AND title IN (
  SELECT DISTINCT prev 
  FROM `fh-bigquery.wikipedia_vt.clickstream_materialized` 
  WHERE date='2019-01-01' AND prev LIKE 'Dogfight%'
  ORDER BY 1  LIMIT 3)

2.9 sec elapsed, 513.8 MB processed

如果我转到v2(而非群集),而不是v3:

FROM `fh-bigquery.wikipedia_v2.pageviews_2019`

2.6 sec elapsed, 9.6 GB processed

我不确定您的表中正在发生什么-但重新访问可能会很有趣。