Question

我的情况

我的redshift群集中有一些表，它们都分为order_id，shipment_id或shipment_item_id，具体取决于表的粒度。 order_id是shipment_id上的1对多关系，而ship_id在shipemnt_item_id上是1对多。

我的问题

我在order_id上分发，因此所有shipment_id和shipment_item_id记录都应位于表中的相同节点上，因为它们按order_id分组。我的问题是，当我必须加入shipment_id或shipment_item_id时，红线会知道记录是在同一个节点上，还是仍然会广播表格，因为它们没有加入order_id？

示例表

unified_order                                   shipment_details
+----------+-------------+------------------+   +-------------+-----------+--------------+
| order_id | shipment_id | shipment_item_id |   | shipment_id | ship_day  | ship_details |
+----------+-------------+------------------+   +-------------+-----------+--------------+
|        1 |           1 |                1 |   |           1 | 1/1/2017  | stuff        |
|        1 |           1 |                2 |   |           2 | 5/1/2017  | other stuff  |
|        1 |           1 |                3 |   |           3 | 6/14/2017 | more stuff   |
|        1 |           2 |                4 |   |           4 | 5/13/2017 | less stuff   |
|        1 |           2 |                5 |   |           5 | 6/19/2017 | that stuff   |
|        1 |           3 |                6 |   |           6 | 7/31/2017 | what stuff   |
|        2 |           4 |                7 |   |           7 | 2/5/2017  | things       |
|        2 |           4 |                8 |   +-------------+-----------+--------------+
|        3 |           5 |                9 |   
|        3 |           5 |               10 |   
|        4 |           6 |               11 |   
|        5 |           7 |               12 |   
|        5 |           7 |               13 |   
+----------+-------------+------------------+

分发

distribution_by_node
+------+----------+-------------+------------------+
| node | order_id | shipment_id | shipment_item_id |
+------+----------+-------------+------------------+
|    1 |        1 |           1 |                1 |
|    1 |        1 |           1 |                2 |
|    1 |        1 |           1 |                3 |
|    1 |        1 |           2 |                4 |
|    1 |        1 |           2 |                5 |
|    1 |        1 |           3 |                6 |
|    1 |        5 |           7 |               12 |
|    1 |        5 |           7 |               13 |
|    2 |        2 |           4 |                7 |
|    2 |        2 |           4 |                8 |
|    3 |        3 |           5 |                9 |
|    3 |        3 |           5 |               10 |
|    4 |        4 |           6 |               11 |
+------+----------+-------------+------------------+

Answer 1

Amazon Redshift文档没有详细说明如何在节点之间共享信息，但是它“广播表”是值得怀疑的。

相反，信息可能是根据需要在节点之间发送的 - 只有相关的列才会被共享，而且可能只是数据的子范围。

您应该根据实际查询测试各种DISTKEY和SORTKEY策略以确定性能，而不是过多地担心内部实现。

按照Choose the Best Distribution Style中的建议，尽量减少需要在节点之间发送的数据量，并咨询Amazon Redshift Best Practices for Designing Queries以改进查询。

Answer 2

您可以EXPLAIN查询以查看在执行期间如何分发（或不分配）数据。在本文档中，您将了解如何阅读查询计划： Evaluating the Query Plan

儿童专栏的红移分布

2 个答案: