我在Redshift中有一个表,有几十亿行,看起来像这样
CREATE TABLE channels AS (
fact_key TEXT NOT NULL distkey
job_key BIGINT
channel_key TEXT NOT NULL
)
diststyle key
compound sortkey(job_key, channel_key);
当我通过job_key + channel_key查询时,如果我在查询中使用channel_key的特定值,则完整sortkey会正确限制我的seq扫描。
EXPLAIN
SELECT * FROM channels scd
WHERE scd.job_key = 1 AND scd.channel_key IN ('1234', '1235', '1236', '1237')
XN Seq Scan on channels scd (cost=0.00..3178474.92 rows=3428929 width=77)
Filter: ((((channel_key)::text = '1234'::text) OR ((channel_key)::text = '1235'::text) OR ((channel_key)::text = '1236'::text) OR ((channel_key)::text = '1237'::text)) AND (job_key = 1))
但是,如果我使用IN +查询channel_key,则子查询Redshift不使用sortkey。
EXPLAIN
SELECT * FROM channels scd
WHERE scd.job_key = 1 AND scd.channel_key IN (select distinct channel_key from other_channel_list where job_key = 14 order by 1)
XN Hash IN Join DS_DIST_ALL_NONE (cost=3.75..3540640.36 rows=899781 width=77)
Hash Cond: (("outer".channel_key)::text = ("inner".channel_key)::text)
-> XN Seq Scan on channels scd (cost=0.00..1765819.40 rows=141265552 width=77)
Filter: (job_key = 1)
-> XN Hash (cost=3.75..3.75 rows=1 width=402)
-> XN Subquery Scan "IN_subquery" (cost=0.00..3.75 rows=1 width=402)
-> XN Unique (cost=0.00..3.74 rows=1 width=29)
-> XN Seq Scan on other_channel_list (cost=0.00..3.74 rows=1 width=29)
Filter: (job_key = 14)
是否有可能让它发挥作用?我的最终目标是将其转换为视图,因此预先定义我的channel_keys列表将无效。
编辑以提供更多背景信息:
这是较大查询的一部分,此get哈希的结果与其他一些数据相关联。如果我对channel_keys进行硬编码,那么对散列连接的输入大约为200万行。如果我将IN条件与子查询一起使用(没有其他更改)则散列连接的输入是4亿行。总查询时间从大约40秒到15分钟以上。
答案 0 :(得分:0)
这是否为您提供了比子查询版本更好的计划?
with other_channels as (
select distinct channel_key from other_channel_list where job_key = 14 order by 1
)
SELECT *
FROM channels scd
JOIN other_channels ocd on scd.channel_key = ocd.channel_key
WHERE scd.job_key = 1