我正在尝试在非常大的桌子上进行同期群分析。我有一个大约30M行的测试表(生产量超过两倍)。 BigQuery中的查询失败,声称“资源超出了...”并且它是一个第18层查询(第1层是5美元,所以这是一个90美元的查询!)
查询:
with cohort_active_user_count as (
select
DATE(`BQ_TABLE`.created_at, '-05:00') as created_at,
count(distinct`BQ_TABLE`.bot_user_id) as count,
`BQ_TABLE`.bot_id as bot_id
from `BQ_TABLE`
group by created_at, bot_id
)
select created_at, period as period,
active_users, retained_users, retention, bot_id
from (
select
DATE(`BQ_TABLE`.created_at, '-05:00') as created_at,
DATE_DIFF(DATE(future_message.created_at, '-05:00'), DATE(`BQ_TABLE`.created_at, '-05:00'), DAY) as period,
max(cohort_size.count) as active_users, -- all equal in group
count(distinct future_message.bot_user_id) as retained_users,
count(distinct future_message.bot_user_id) / max(cohort_size.count) as retention,
`BQ_TABLE`.bot_id as bot_id
from `BQ_TABLE`
left join `BQ_TABLE` as future_message on
`BQ_TABLE`.bot_user_id = future_message.bot_user_id
and `BQ_TABLE`.created_at < future_message.created_at
and TIMESTAMP_ADD(`BQ_TABLE`.created_at, interval 720 HOUR) >= future_message.created_at
and `BQ_TABLE`.bot_id = future_message.bot_id
left join cohort_active_user_count as cohort_size on
DATE(`BQ_TABLE`.created_at, '-05:00') = cohort_size.created_at
and `BQ_TABLE`.bot_id = cohort_size.bot_id
group by 1, 2, bot_id) t
where period is not null
and bot_id = 80
order by created_at, period, bot_id
这是所需的输出:
根据我对BigQuery的理解,由于每个BigQuery节点都需要处理它们,因此连接会导致重大性能损失。该表按日分区,我还没有在此查询中使用,但我知道它仍然需要进行优化。
如何优化此查询或排除使用连接以允许BigQuery更高效地并行处理?
答案 0 :(得分:2)
第1步
请尝试以下
在内部cohort_active_user_count
之外的SELECT
移动了JOIN'ing,因为我认为这是查询费用高昂的主要原因之一。如你所见 - 使用JOIN代替LEFT JOIN,因为这里不需要LEFT
请测试并告诉我们结果
WITH cohort_active_user_count AS (
SELECT
DATE(BQ_TABLE.created_at, '-05:00') AS created_at,
COUNT(DISTINCT BQ_TABLE.bot_user_id) AS COUNT,
BQ_TABLE.bot_id AS bot_id
FROM BQ_TABLE
GROUP BY created_at, bot_id
)
SELECT t.created_at, period AS period,
cohort_size.count AS active_users, retained_users,
retained_users / cohort_size.count AS retention, t.bot_id
FROM (
SELECT
DATE(BQ_TABLE.created_at, '-05:00') AS created_at,
DATE_DIFF(DATE(future_message.created_at, '-05:00'), DATE(BQ_TABLE.created_at, '-05:00'), DAY) AS period,
COUNT(DISTINCT future_message.bot_user_id) AS retained_users,
BQ_TABLE.bot_id AS bot_id
FROM BQ_TABLE
LEFT JOIN BQ_TABLE AS future_message
ON BQ_TABLE.bot_user_id = future_message.bot_user_id
AND BQ_TABLE.created_at < future_message.created_at
AND TIMESTAMP_ADD(BQ_TABLE.created_at, interval 720 HOUR) >= future_message.created_at
AND BQ_TABLE.bot_id = future_message.bot_id
GROUP BY 1, 2, bot_id
HAVING period IS NOT NULL
) t
JOIN cohort_active_user_count AS cohort_size
ON t.created_at = cohort_size.created_at
AND t.bot_id = cohort_size.bot_id
WHERE t.bot_id = 80
ORDER BY created_at, period, bot_id
第2步
下面的“进一步优化”是基于这样的假设:您的BQ_TABLE是同一天同一个user_id / bit_id的多个条目的原始数据 - 因此增加了内部SELECT中LEFT JOIN的大量费用。
我建议首先汇总这一点,如下所示。除了大幅减少JOIN的大小 - 它还消除了每个连接行中从TIMESTAMP到DATE的所有转换
WITH BQ_TABLE_AGG AS (
SELECT bot_id, bot_user_id, DATE(BQ_TABLE.created_at, '-05:00') AS created_at
FROM BQ_TABLE
GROUP BY 1, 2, 3
),
cohort_active_user_count AS (
SELECT
created_at,
COUNT(DISTINCT bot_user_id) AS COUNT,
bot_id AS bot_id
FROM BQ_TABLE_AGG
GROUP BY created_at, bot_id
)
SELECT t.created_at, period AS period,
cohort_size.count AS active_users, retained_users,
retained_users / cohort_size.count AS retention, t.bot_id
FROM (
SELECT
BQ_TABLE_AGG.created_at AS created_at,
DATE_DIFF(future_message.created_at, BQ_TABLE_AGG.created_at, DAY) AS period,
COUNT(DISTINCT future_message.bot_user_id) AS retained_users,
BQ_TABLE_AGG.bot_id AS bot_id
FROM BQ_TABLE_AGG
LEFT JOIN BQ_TABLE_AGG AS future_message
ON BQ_TABLE_AGG.bot_user_id = future_message.bot_user_id
AND BQ_TABLE_AGG.created_at < future_message.created_at
AND DATE_ADD(BQ_TABLE_AGG.created_at, INTERVAL 30 DAY) >= future_message.created_at
AND BQ_TABLE_AGG.bot_id = future_message.bot_id
GROUP BY 1, 2, bot_id
HAVING period IS NOT NULL
) t
JOIN cohort_active_user_count AS cohort_size
ON t.created_at = cohort_size.created_at
AND t.bot_id = cohort_size.bot_id
WHERE t.bot_id = 80
ORDER BY created_at, period, bot_id
答案 1 :(得分:0)
If you don't want to enable a higher billing tier given the costs, here are a couple of suggestions that might help to reduce the CPU requirements:
INNER JOIN
s rather than LEFT JOIN
s if you can. INNER JOIN
s should generally be less CPU-intensive, but then again you won't get unmatched rows like you would with LEFT JOIN
s.APPROX_COUNT_DISTINCT(expr)
instead of COUNT(DISTINCT expr
). You won't get an exact count, but it's less CPU-intensive and may be "good enough" depending on your needs.You could also consider manually breaking the query into stages of computation, e.g. write the WITH
clause statement to a table, then use that in the subsequent query. I don't know what the specific cost tradeoffs would be, though.
答案 2 :(得分:0)
为什么它标记了MySQL?
在MySQL中,我会改变
max(cohort_size.count) as active_users, -- all equal in group
到
( SELECT max(count) FROM cohort_active_user_count WHERE ... ) as active_users,
并删除该表的JOIN
。如果不这样做,您就有可能夸大COUNT(...)
值!
同时移动该部门以将retention
引入外部查询。
完成后,您还可以将其他JOIN
移动到子查询中:
( SELECT count(distinct future_message.bot_user_id)
FROM ... WHERE ... ) as retained_users,
我会有这些索引。请注意,created_at
必须是最后一个。
cohort_active_user_count: INDEX(bot_id, created_at)
future_message: (bot_id, bot_user_id, created_at)