我注意到下面的查询运行缓慢,在详细查看之后,我想知道为什么Redshift会首先扫描两个表(事件和联系人),然后将它们连接在一起。联系表中有超过300,000行。 我的期望是Redshift应首先根据为其指定的过滤器扫描大型事件表,然后根据Contact_IDs列查找其中的联系人。我的期望不正确吗?我还能做些什么来加快查询速度吗?我在所有桌子上执行了真空和分析。
查询:
select c.Segment
, Count (Distinct (CASE WHEN et.Event_ID = 1 THEN et.Contact_ID ELSE null END)) as L1
, Count (Distinct (CASE WHEN et.Event_ID = 2 THEN et.Contact_ID ELSE null END)) as L2
from
Events et
jon contact c on c.Account_ID = et.Account_ID and c.ID = et.Contact_ID
where
et.Account_ID = 5
and et.Event_ID in (1, 2)
and et.IsGuest = 0
and et.dim_date_id >=20151125
and et.dim_date_id <=20160226
group by c.Segment
order by 1
说明:
XN Merge (cost=1000000074927.82..1000000074927.83 rows=1 width=20)
-> XN Network (cost=1000000074927.82..1000000074927.83 rows=1 width=20)
-> XN Sort (cost=1000000074927.82..1000000074927.83 rows=1 width=20)
-> XN HashAggregate (cost=74927.80..74927.81 rows=1 width=20)
-> XN Merge Join DS_DIST_NONE (cost=0.00..74927.57 rows=31 width=20)
-> XN Seq Scan on contact c (cost=0.00..497.56 rows=39805 width=16)
-> XN Seq Scan on eventtransaction et (cost=0.00..6664.84 rows=136 width=20)
答案 0 :(得分:0)
仅在执行连接后才应用过滤器。如果您希望在应用过滤器后进行连接,我建议您创建一个临时表,并将其与您在代码中指示的联系表一起加入。
select c.Segment
, Count (Distinct (CASE WHEN et.Event_ID = 1 THEN et.Contact_ID ELSE null END)) as L1
, Count (Distinct (CASE WHEN et.Event_ID = 2 THEN et.Contact_ID ELSE null END)) as L2
from
(
select Event_ID, Account_ID, Contact_ID
FROM event
WHERE
et.Account_ID = 5
and et.Event_ID in (1, 2)
and et.IsGuest = 0
and et.dim_date_id >=20151125
and et.dim_date_id <=20160226
)et
join contact c on c.Account_ID = et.Account_ID and c.ID = et.Contact_ID
group by c.Segment
order by 1
此外,如果您在dim_date_id
上设置了排序键,您会看到此查询的速度有所提升。有关相同内容的更多详细信息,请参见here