我有我的事实表t_session
,看起来像这样简化了:
+------------+----------+-------------+
| start_hm | end_hm | device_id |
|------------+----------+-------------|
| 0 | 10 | 111 |
| 2 | 10 | 112 |
| 12 | 20 | 113 |
| 60 | 90 | 111 |
| 60 | 90 | 112 |
我还有维度表dim_time
,其中包含1440条记录,每小时0-23小时,0-59分钟。因此,它包含一天中一小时一分钟的所有组合。 tk
是范围为0-1439的数字
+------+--------+----------+
| tk | hour | minute |
|------+--------+----------|
| 0 | 0 | 0 |
| 1 | 0 | 1 |
| 2 | 0 | 2 |
............................
| 60 | 1 | 0 |
| 61 | 1 | 1 |
| 62 | 1 | 2 |
............................
| 120 | 2 | 0 |
| 121 | 2 | 1 |
| 122 | 2 | 2 |
............................
我想计算每分钟有效device_id的数量。在实际的应用程序中,还有另一个表dim_date
和其他六个关系,但是让我们对这个问题保持简单。
该设备在start_hm
和end_hm
之间的时隙中处于活动状态。 start_hm
和end_hm
的值都介于0和1439之间。
select count(distinct device_id)
from t_session
join dim_time on tk between start_hm and end_hm
group by tk
order by tk;
此查询执行得像地狱一样慢。当我查看执行计划时,它会抱怨嵌套循环。
+--------------------------------------------------------------------------------------------------------------------------------+
| QUERY PLAN |
|--------------------------------------------------------------------------------------------------------------------------------|
| XN Limit (cost=1000002000820.94..1000002000820.97 rows=10 width=8) |
| -> XN Merge (cost=1000002000820.94..1000002000821.44 rows=200 width=8) |
| Merge Key: tk |
| -> XN Network (cost=1000002000820.94..1000002000821.44 rows=200 width=8) |
| Send to leader |
| -> XN Sort (cost=1000002000820.94..1000002000821.44 rows=200 width=8) |
| Sort Key: tk |
| -> XN HashAggregate (cost=2000812.80..2000813.30 rows=200 width=8) |
| -> XN Subquery Scan volt_dt_0 (cost=2000764.80..2000796.80 rows=3200 width=8) |
| -> XN HashAggregate (cost=2000764.80..2000764.80 rows=3200 width=8) |
| -> XN Nested Loop DS_BCAST_INNER (cost=0.00..2000748.80 rows=3200 width=8) |
| Join Filter: (("outer".tk <= "inner".end_hm) AND ("outer".tk >= "inner".start_hm)) |
| -> XN Seq Scan on dim_time (cost=0.00..28.80 rows=2880 width=4) |
| -> XN Seq Scan on t_session (cost=0.00..0.10 rows=10 width=12) |
| ----- Nested Loop Join in the query plan - review the join predicates to avoid Cartesian products ----- |
+--------------------------------------------------------------------------------------------------------------------------------+
我了解嵌套循环的来源。它需要为dim_time
中的每个记录循环t_session
。
是否可以修改查询以避免嵌套循环并提高性能?
更新:同一查询在Postgres上运行得非常快,执行计划没有笛卡尔积。
+--------------------------------------------------------------------------------------------------------------+
| QUERY PLAN |
|--------------------------------------------------------------------------------------------------------------|
| Limit (cost=85822.07..85839.17 rows=10 width=12) |
| -> GroupAggregate (cost=85822.07..88284.47 rows=1440 width=12) |
| Group Key: dim_time.tk |
| -> Sort (cost=85822.07..86638.07 rows=326400 width=8) |
| Sort Key: dim_time.tk |
| -> Nested Loop (cost=0.00..51467.40 rows=326400 width=8) |
| Join Filter: ((dim_time.tk >= t_session.start_hm) AND (dim_time.tk <= t_session.end_hm)) |
| -> Seq Scan on t_session (cost=0.00..30.40 rows=2040 width=12) |
| -> Materialize (cost=0.00..32.60 rows=1440 width=4) |
| -> Seq Scan on dim_time (cost=0.00..25.40 rows=1440 width=4) |
+--------------------------------------------------------------------------------------------------------------+
UPDATE2:
t_session
表的device_id
列作为DISTKEY,而start_date
列(在简化示例中未显示)作为SORTKEY:会话自然由start_date
进行排序。
dim_time
表具有tk
作为SORTKEY和DISTSTYLE ALL。
在Redshift上,每天40000个会话的执行时间为5-6分钟。还有Postgres上的几秒钟。
Redshift集群上有两个dc2.large节点