使用Google Bigquery,我通过在聚类字段projectId
上应用过滤器来查询聚簇表,如下所示:
WITH userProjects AS (
SELECT
projectsArray
FROM
projectsPerUser
WHERE
userId = "eben@somewhere.com"
)
SELECT
userProperty
FROM
`mydata.mydataset.mytable`
WHERE
--projectId IN UNNEST((SELECT projectsArray FROM userProjects))
projectId IN ("mydata", "anotherproject")
AND _PARTITIONTIME >= "2019-03-20"
在上面的代码片段中正确应用了集群,但是当我使用注释行--projectId IN UNNEST((SELECT projectsArray FROM userProjects))
时,集群不适用。
我也尝试过将其包装在这样的UDF中,这也不起作用:
CREATE TEMP FUNCTION storedValue(item ARRAY<STRING>) AS (
item
);
...
WHERE projectId IN UNNEST(storedValue((SELECT projectsListArray FROM projectsList)))
据我了解,子选择查询的执行路径不同于仅直接在标量或数组上进行过滤。
我希望存在一个解决方案,在该方案中我可以以编程方式提供要过滤的数组,这仍将使我获得聚簇表提供的成本优势。
总结:
WHERE projectId IN ("mydata", "anotherproject")
[确定] WHERE projectId IN UNNEST((SELECT projectsArray FROM userProjects))
[不正常] WHERE projectId IN UNNEST(storedValue((SELECT projectsListArray FROM projectsList)))
[不正常] 有什么想法吗?
答案 0 :(得分:1)
我的建议是重写查询,以使嵌套的SELECT是一个临时表(您已经完成),然后使用INNER JOIN而不是一组成员资格测试来执行所需的过滤,因此查询将变成这样:
WITH userProjects AS (
SELECT
projectsArray
FROM
projectsPerUser
WHERE
userId = "eben@somewhere.com"
)
SELECT
userProperty
FROM
`mydata.mydataset.mytable` as a
JOIN
userProjects as b
ON a.projectId = b.projectsArray
WHERE
AND _PARTITIONTIME >= "2019-03-20"
我相信,如果该字段是集群的,这将导致查询不扫描整个分区。
答案 1 :(得分:1)
FWIW,使用动态过滤器,群集对我来说效果很好:
SELECT title, SUM(views) views
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE DATE(TIMESTAMP_TRUNC(datehour, DAY)) = '2019-01-01'
AND wiki='en'
AND title IN ('Dogfight_(disambiguation)','Dogfight','Dogfight_(film)')
GROUP BY 1
1.8 sec elapsed, 364 MB processed
如果我愿意
AND title IN (
SELECT DISTINCT prev
FROM `fh-bigquery.wikipedia_vt.clickstream_materialized`
WHERE date='2019-01-01' AND prev LIKE 'Dogfight%'
ORDER BY 1 LIMIT 3)
2.9 sec elapsed, 513.8 MB processed
如果我转到v2(而非群集),而不是v3:
FROM `fh-bigquery.wikipedia_v2.pageviews_2019`
2.6 sec elapsed, 9.6 GB processed
我不确定您的表中正在发生什么-但重新访问可能会很有趣。