Redshift查询在聚合联接中返回太多行

时间:2018-06-19 14:53:27

标签: sql join amazon-redshift

我确信我一定会遗漏一些明显的东西。我试图将两个具有不同测量数据的表对齐以进行分析,当我将两个表连接在一起时,计数又大大增加了。

这是我的表1

中的正确计数
select line_item_id,sum(is_imp) as imps 
from table1 
where line_item_id=5993252 
group by 1;

enter image description here

这是表2

中的正确计数
select cs_line_item_id,sum(grossImpressions) as cs_imps
from table2 
where cs_line_item_id=5993252 
group by 1;

enter image description here

当我将表连接在一起时,我的计数变得不准确:

select a.line_item_id,sum(a.is_imp) as imps,sum(c.grossImpressions) as cs_imps
from table1 a join table2 c
ON a.line_item_id=c.cs_line_item_id
where a.line_item_id=5993252
group by 1;

enter image description here

我正在使用聚合,分组,过滤,因此我不确定哪里出了问题。这是这些表的架构: enter image description here

2 个答案:

答案 0 :(得分:2)

select a.*, b.imps table2_imps from
(select line_item_id,sum(is_imp) as imps 
from table1  
group by 1)a
join 
(select line_item_id,sum(is_imp) as imps 
from table1  
group by 1)b
on a.select line_item_id=b.select line_item_id

答案 1 :(得分:1)

您正在为每个line_item_id生成笛卡尔积。有两种相对简单的方法可以解决此问题,一种方法是使用full join,另一种方法是使用union all

select line_item_id, sum(imps) as imps, sum(grossImpressions) as cs_imps
from ((select a.line_time_id, sum(is_imp) as imps, 0 as grossImpressions
       from table1 a
       where a.line_item_id = 5993252
       group by a.line_item_id
      ) union all
      (select c.line_time_id, 0 as imps, sum(grossImpressions) as grossImpressions
       from table2 c
       where c.line_item_id = 5993252
       group by c.line_item_id
      )
     ) ac
group by line_item_id;

您可以从子查询中删除where子句,以获取所有line_tiem_id的总数。请注意,即使对于给定的line_item_id,一个或另一个表都没有匹配的行,这也可以工作。

为了提高性能,您确实想在group by之前 进行过滤。