Question

我正在尝试在非常大的桌子上进行同期群分析。我有一个大约30M行的测试表（生产量超过两倍）。 BigQuery中的查询失败，声称“资源超出了...”并且它是一个第18层查询（第1层是5美元，所以这是一个90美元的查询！）

查询：

with cohort_active_user_count as (
  select 
    DATE(`BQ_TABLE`.created_at, '-05:00') as created_at,
    count(distinct`BQ_TABLE`.bot_user_id) as count,
    `BQ_TABLE`.bot_id as bot_id
  from `BQ_TABLE`
  group by created_at, bot_id
)

select created_at, period as period,
  active_users, retained_users, retention, bot_id
from (
  select 
    DATE(`BQ_TABLE`.created_at, '-05:00') as created_at,
    DATE_DIFF(DATE(future_message.created_at, '-05:00'), DATE(`BQ_TABLE`.created_at, '-05:00'), DAY) as period,
    max(cohort_size.count) as active_users, -- all equal in group
    count(distinct future_message.bot_user_id) as retained_users,
    count(distinct future_message.bot_user_id) / max(cohort_size.count) as retention,
    `BQ_TABLE`.bot_id as bot_id
  from `BQ_TABLE`
  left join `BQ_TABLE` as future_message on
    `BQ_TABLE`.bot_user_id = future_message.bot_user_id
    and `BQ_TABLE`.created_at < future_message.created_at
    and TIMESTAMP_ADD(`BQ_TABLE`.created_at, interval 720 HOUR) >= future_message.created_at
    and `BQ_TABLE`.bot_id = future_message.bot_id 
  left join cohort_active_user_count as cohort_size on 
    DATE(`BQ_TABLE`.created_at, '-05:00') = cohort_size.created_at
    and `BQ_TABLE`.bot_id = cohort_size.bot_id 
  group by 1, 2, bot_id) t
where period is not null
and bot_id = 80
order by created_at, period, bot_id

这是所需的输出：

根据我对BigQuery的理解，由于每个BigQuery节点都需要处理它们，因此连接会导致重大性能损失。该表按日分区，我还没有在此查询中使用，但我知道它仍然需要进行优化。

如何优化此查询或排除使用连接以允许BigQuery更高效地并行处理？

Answer 1

第1步

请尝试以下
在内部cohort_active_user_count之外的SELECT移动了JOIN'ing，因为我认为这是查询费用高昂的主要原因之一。如你所见 - 使用JOIN代替LEFT JOIN，因为这里不需要LEFT

请测试并告诉我们结果

WITH cohort_active_user_count AS (
  SELECT 
    DATE(BQ_TABLE.created_at, '-05:00') AS created_at,
    COUNT(DISTINCT BQ_TABLE.bot_user_id) AS COUNT,
    BQ_TABLE.bot_id AS bot_id
  FROM BQ_TABLE
  GROUP BY created_at, bot_id
)
SELECT t.created_at, period AS period,
  cohort_size.count AS active_users, retained_users, 
  retained_users / cohort_size.count AS retention, t.bot_id
FROM (
  SELECT 
    DATE(BQ_TABLE.created_at, '-05:00') AS created_at,
    DATE_DIFF(DATE(future_message.created_at, '-05:00'), DATE(BQ_TABLE.created_at, '-05:00'), DAY) AS period,
    COUNT(DISTINCT future_message.bot_user_id) AS retained_users,
    BQ_TABLE.bot_id AS bot_id
  FROM BQ_TABLE
  LEFT JOIN BQ_TABLE AS future_message 
    ON BQ_TABLE.bot_user_id = future_message.bot_user_id
    AND BQ_TABLE.created_at < future_message.created_at
    AND TIMESTAMP_ADD(BQ_TABLE.created_at, interval 720 HOUR) >= future_message.created_at
    AND BQ_TABLE.bot_id = future_message.bot_id 
  GROUP BY 1, 2, bot_id
  HAVING period IS NOT NULL
) t
JOIN cohort_active_user_count AS cohort_size 
  ON t.created_at = cohort_size.created_at
  AND t.bot_id = cohort_size.bot_id 
WHERE t.bot_id = 80
ORDER BY created_at, period, bot_id

第2步

下面的“进一步优化”是基于这样的假设：您的BQ_TABLE是同一天同一个user_id / bit_id的多个条目的原始数据 - 因此增加了内部SELECT中LEFT JOIN的大量费用。
我建议首先汇总这一点，如下所示。除了大幅减少JOIN的大小 - 它还消除了每个连接行中从TIMESTAMP到DATE的所有转换

WITH BQ_TABLE_AGG AS (
  SELECT bot_id, bot_user_id, DATE(BQ_TABLE.created_at, '-05:00') AS created_at
  FROM BQ_TABLE
  GROUP BY 1, 2, 3
),
cohort_active_user_count AS (
  SELECT 
    created_at,
    COUNT(DISTINCT bot_user_id) AS COUNT,
    bot_id AS bot_id
  FROM BQ_TABLE_AGG
  GROUP BY created_at, bot_id
)
SELECT t.created_at, period AS period,
  cohort_size.count AS active_users, retained_users, 
  retained_users / cohort_size.count AS retention, t.bot_id
FROM (
  SELECT 
    BQ_TABLE_AGG.created_at AS created_at,
    DATE_DIFF(future_message.created_at, BQ_TABLE_AGG.created_at, DAY) AS period,
    COUNT(DISTINCT future_message.bot_user_id) AS retained_users,
    BQ_TABLE_AGG.bot_id AS bot_id
  FROM BQ_TABLE_AGG
  LEFT JOIN BQ_TABLE_AGG AS future_message 
    ON BQ_TABLE_AGG.bot_user_id = future_message.bot_user_id
    AND BQ_TABLE_AGG.created_at < future_message.created_at
    AND DATE_ADD(BQ_TABLE_AGG.created_at, INTERVAL 30 DAY) >= future_message.created_at
    AND BQ_TABLE_AGG.bot_id = future_message.bot_id 
  GROUP BY 1, 2, bot_id
  HAVING period IS NOT NULL
) t
JOIN cohort_active_user_count AS cohort_size 
  ON t.created_at = cohort_size.created_at
  AND t.bot_id = cohort_size.bot_id 
WHERE t.bot_id = 80
ORDER BY created_at, period, bot_id

Answer 2

If you don't want to enable a higher billing tier given the costs, here are a couple of suggestions that might help to reduce the CPU requirements:

Use INNER JOINs rather than LEFT JOINs if you can. INNER JOINs should generally be less CPU-intensive, but then again you won't get unmatched rows like you would with LEFT JOINs.
Use APPROX_COUNT_DISTINCT(expr) instead of COUNT(DISTINCT expr). You won't get an exact count, but it's less CPU-intensive and may be "good enough" depending on your needs.

You could also consider manually breaking the query into stages of computation, e.g. write the WITH clause statement to a table, then use that in the subsequent query. I don't know what the specific cost tradeoffs would be, though.

Answer 3

为什么它标记了MySQL？

在MySQL中，我会改变

max(cohort_size.count) as active_users, -- all equal in group

到

( SELECT max(count) FROM cohort_active_user_count WHERE ... ) as active_users,

并删除该表的JOIN。如果不这样做，您就有可能夸大COUNT(...)值！

同时移动该部门以将retention引入外部查询。

完成后，您还可以将其他JOIN移动到子查询中：

( SELECT count(distinct future_message.bot_user_id)
    FROM ... WHERE ... ) as retained_users,

我会有这些索引。请注意，created_at必须是最后一个。

cohort_active_user_count:  INDEX(bot_id, created_at)
future_message: (bot_id, bot_user_id, created_at)

优化Google BigQuery上的群组分析

3 个答案: