BigQuery避免多个子查询

时间:2018-06-16 02:14:04

标签: sql google-bigquery

我们正在开发一个应用程序,它将请求存储在一个表中,而响应则存储在另一个表中(当然)。我们可以为每个请求提供多个响应,并将请求ID存储在两个表中。

最初,我认为我们可以使用请求中的左连接 - >回答计算每个匹配标准的总数:

SELECT source, COUNT(*) as requests, COUNT(responses.request_id) as responses
FROM DATASET.requests
LEFT JOIN DATASET.responses ON requests.id = responses.request_id
WHERE source = "source1"
GROUP BY source

有70个请求符合WHERE标准,30个响应符合此条件。预期输出为:“source1,70,30”。 我已经了解了更多有关JOIN行为的信息,而是获得了“source1,259,207”。双方都有重复的身份证。

我唯一能够获得所需结果的方法是创建一个巨大的查询,以及在给定条件下过滤的ID集内匹配的多个完整子查询。然后使用过滤后的ID集来真正提取我们的字段,统计信息等

SELECT * FROM
  (SELECT COUNT(*) as responses FROM DATASET.responses
  WHERE id IN (SELECT id FROM DATASET.requests WHERE source = 
  "source1"))
  ,
 (SELECT source, COUNT(*) as requests
  FROM  PUBDATA.requests
  WHERE id IN (SELECT id FROM DATASET.requests WHERE source = "source1")
  GROUP BY source)

这看起来很可怕。我已经尝试使用CTE来收集我们想要的ID列表,并使用WHERE id / request_id IN(cte.id),但这显然是不可能的,除非我们加入cte,这又会产生错误和相乘的结果。

由于我们想在查询中添加额外的统计信息,这需要更多的WHERE子句,我担心这个怪物会继续增长并且很难实现。

如果有更好的方法,请告诉我。谢谢!

编辑 - 请求的示例模式 的

id (String), source (String), partner_ids (Integer array), user_agent (String), timestamp (Timestamp), ...

响应

request_id (String, from requests.id), partner_id (Integer), is_billed (boolean), price_charged (float, null if is_billed = false), response_categories (String array, not from requests), ...

挑战在于我们必须主要查询Requests表以获取符合我们条件的ID值列表,然后在每个表上查询统计报告的统计数据(例如计数,计算is_billed等等)。我们可能还需要从每个表的标准中提取ID池(例如,其中requests.source ='source1'和respond.response_categories IN'action')

3 个答案:

答案 0 :(得分:0)

我认为您可以使用union allgroup by

执行您想要的操作
select source, sum(requests) as requests, sum(responses) as responses
from ((select source, count(*) as requests, 0 as response
       from dataset.requests
       group by source
      ) union all
      (select source, 0 as requests, count(*) as responses
       from dataset.responses
       group by source
      )
     ) rr
group by source;

这是对所有来源的计算。

编辑:

对于修订版,只需使用额外的join

select source, sum(requests) as requests, sum(responses) as responses
from ((select source, count(*) as requests, 0 as response
       from dataset.requests rq
       group by rq.source
      ) union all
      (select rq.source, 0 as requests, count(*) as responses
       from dataset.responses r join
            (select distinct rq.id
             from dataset.requests rq
            ) rq
            on r.id = rq.id
       group by rq.source
      )
     ) rr
group by source;

如果每个请求最多只有一个响应,您可以将其缩短为:

select rq.source, count(*) as requests, count(r.id) as responses
from dataset.requests rq left join
     dataset.responses r
     on r.id = rq.id
group by rq.source

答案 1 :(得分:0)

也许我误解了一些事情,为什么你不计算每一个并加入id?

WITH
    sources
    AS
        (  SELECT COUNT (*) source_cnt, id
             FROM dataset.request
         GROUP BY id),
    responses
    AS
        (  SELECT COUNT (*) AS response_cnt, id
             FROM dataset.responses
         GROUP BY id)
SELECT source_cnt, response_cnt, sources.id
  FROM sources INNER JOIN responses ON sources.id = responses.id;

如果要保留所有记录,可以将其修改为完全外部联接:

WITH
    sources
    AS
        (  SELECT COUNT (*) source_cnt, id
             FROM dataset.request
         GROUP BY id),
    responses
    AS
        (  SELECT COUNT (*) AS response_cnt, id
             FROM dataset.responses
         GROUP BY id)
SELECT COALESCE (sources.id, responses.id) AS id, source_cnt, response_cnt
  FROM sources FULL OUTER JOIN responses ON sources.id = responses.id

答案 2 :(得分:0)

说实话,我对您最终希望看到的内容感到有些困惑,而且如果一个请求可以有多个响应,我也不完全理解您如何有70个请求而只有30个响应。您是说某些请求的回复为0吗?还是您在计算不同的响应?

如果您希望计算请求总数和与那些特定请求相关的响应总数,我认为对代码的这种轻微修改应该可以:

SELECT source, COUNT(DISTINCT id) as requests, COUNT(responses.request_id) as responses
FROM `dataset.requests` as requests
LEFT JOIN `dataset.responses` as responses ON requests.id = responses.request_id
WHERE source = "source1"
GROUP BY source
相关问题