我在资源和性能方面寻找Joins vs Subquery,答案似乎取决于平台。但似乎没有任何关于BigQuery的讨论。
当我扩展查询范围以包含100 GB时,我遇到了
Query Failed
Error: Resources exceeded during query execution.
我大致有
#standardSQL
SELECT * FROM table t1 WHERE
(t1.a in (SELECT b FROM anothertable WHERE class='value')
OR t1.a in (SELECT c FROM table2) )
我想知道如果我扩展到TB数据,JOIN在BigQuery中是否会更好。
答案 0 :(得分:3)
请注意此查询与下一个查询之间的区别:
1)
#standardSQL
SELECT COUNTIF(author IN (
SELECT author
FROM `fh-bigquery.reddit_comments.2017_01`
))
FROM `fh-bigquery.reddit_comments.2017_01`
2)
#standardSQL
SELECT COUNTIF(author IN (
SELECT DISTINCT author
FROM `fh-bigquery.reddit_comments.2017_01`
))
FROM `fh-bigquery.reddit_comments.2017_01`
这是一个愚蠢的查询 - 两者都应该返回157893170
。尽管如此,1)跑了8分钟(到目前为止),而2)跑了36秒。
秘密?在执行IN()
时,请确保删除带有DISTINCT
的重复项 - 如果不是,则JOIN中会有很多行根本不会更改结果。
// TODO(gcp): This could be a BigQuery optimization.
答案 1 :(得分:0)
我想知道,您是否尝试过Elliott使用EXISTS
的建议?
类似的东西:
WITH table1 AS(
SELECT '1' as user, 1 AS id UNION ALL
SELECT '2' AS user, 2 as id UNION ALL
SELECT '3' AS user, 3 as id
),
anothertable AS(
SELECT '1' AS user, 'value' AS class , '4' AS c UNION ALL
SELECT '2' AS user, 'value2' AS class, '2' AS c UNION ALL
SELECT '4' AS user, 'value' AS class, '3' AS c UNION ALL
SELECT '5' AS user, 'value2' AS class, '5' as c
),
table2 AS(
SELECT '4' AS c UNION ALL
SELECT '2' AS c UNION ALL
SELECT '3' AS c UNION ALL
SELECT '5' as c
)
SELECT
t1.*
FROM table1 t1
WHERE TRUE
AND EXISTS(SELECT 1 FROM anothertable ta WHERE (class = 'value' AND t1.user = ta.user))
OR EXISTS(SELECT 1 FROM table2 t2 WHERE t1.user = t2.c)
它是否超出资源?