Question

我有我的事实表t_session，看起来像这样简化了：

+------------+----------+-------------+
| start_hm   | end_hm   | device_id   |
|------------+----------+-------------|
| 0          | 10       | 111         |
| 2          | 10       | 112         |
| 12         | 20       | 113         |
| 60         | 90       | 111         |
| 60         | 90       | 112         |

我还有维度表dim_time，其中包含1440条记录，每小时0-23小时，0-59分钟。因此，它包含一天中一小时一分钟的所有组合。 tk是范围为0-1439的数字

+------+--------+----------+
| tk   | hour   | minute   |
|------+--------+----------|
| 0    | 0      | 0        |
| 1    | 0      | 1        |
| 2    | 0      | 2        |
............................
| 60   | 1      | 0        |
| 61   | 1      | 1        |
| 62   | 1      | 2        |
............................
| 120  | 2      | 0        |
| 121  | 2      | 1        |
| 122  | 2      | 2        |
............................

我想计算每分钟有效device_id的数量。在实际的应用程序中，还有另一个表dim_date和其他六个关系，但是让我们对这个问题保持简单。

该设备在start_hm和end_hm之间的时隙中处于活动状态。 start_hm和end_hm的值都介于0和1439之间。

select count(distinct device_id)
from t_session
join dim_time on tk between start_hm and end_hm
group by tk
order by tk;

此查询执行得像地狱一样慢。当我查看执行计划时，它会抱怨嵌套循环。

+--------------------------------------------------------------------------------------------------------------------------------+
| QUERY PLAN                                                                                                                     |
|--------------------------------------------------------------------------------------------------------------------------------|
| XN Limit  (cost=1000002000820.94..1000002000820.97 rows=10 width=8)                                                            |
|   ->  XN Merge  (cost=1000002000820.94..1000002000821.44 rows=200 width=8)                                                     |
|         Merge Key: tk                                                                                                          |
|         ->  XN Network  (cost=1000002000820.94..1000002000821.44 rows=200 width=8)                                             |
|               Send to leader                                                                                                   |
|               ->  XN Sort  (cost=1000002000820.94..1000002000821.44 rows=200 width=8)                                          |
|                     Sort Key: tk                                                                                               |
|                     ->  XN HashAggregate  (cost=2000812.80..2000813.30 rows=200 width=8)                                       |
|                           ->  XN Subquery Scan volt_dt_0  (cost=2000764.80..2000796.80 rows=3200 width=8)                      |
|                                 ->  XN HashAggregate  (cost=2000764.80..2000764.80 rows=3200 width=8)                          |
|                                       ->  XN Nested Loop DS_BCAST_INNER  (cost=0.00..2000748.80 rows=3200 width=8)             |
|                                             Join Filter: (("outer".tk <= "inner".end_hm) AND ("outer".tk >= "inner".start_hm)) |
|                                             ->  XN Seq Scan on dim_time  (cost=0.00..28.80 rows=2880 width=4)                  |
|                                             ->  XN Seq Scan on t_session  (cost=0.00..0.10 rows=10 width=12)                   |
| ----- Nested Loop Join in the query plan - review the join predicates to avoid Cartesian products -----                        |
+--------------------------------------------------------------------------------------------------------------------------------+

我了解嵌套循环的来源。它需要为dim_time中的每个记录循环t_session。

是否可以修改查询以避免嵌套循环并提高性能？

更新：同一查询在Postgres上运行得非常快，执行计划没有笛卡尔积。

+--------------------------------------------------------------------------------------------------------------+
| QUERY PLAN                                                                                                   |
|--------------------------------------------------------------------------------------------------------------|
| Limit  (cost=85822.07..85839.17 rows=10 width=12)                                                            |
|   ->  GroupAggregate  (cost=85822.07..88284.47 rows=1440 width=12)                                           |
|         Group Key: dim_time.tk                                                                               |
|         ->  Sort  (cost=85822.07..86638.07 rows=326400 width=8)                                              |
|               Sort Key: dim_time.tk                                                                          |
|               ->  Nested Loop  (cost=0.00..51467.40 rows=326400 width=8)                                     |
|                     Join Filter: ((dim_time.tk >= t_session.start_hm) AND (dim_time.tk <= t_session.end_hm)) |
|                     ->  Seq Scan on t_session  (cost=0.00..30.40 rows=2040 width=12)                         |
|                     ->  Materialize  (cost=0.00..32.60 rows=1440 width=4)                                    |
|                           ->  Seq Scan on dim_time  (cost=0.00..25.40 rows=1440 width=4)                     |
+--------------------------------------------------------------------------------------------------------------+

UPDATE2：

t_session表的device_id列作为DISTKEY，而start_date列（在简化示例中未显示）作为SORTKEY：会话自然由start_date进行排序。

dim_time表具有tk作为SORTKEY和DISTSTYLE ALL。

在Redshift上，每天40000个会话的执行时间为5-6分钟。还有Postgres上的几秒钟。

Redshift集群上有两个dc2.large节点

Redshift：查询计划中的嵌套循环联接

0 个答案: