Question

我在RedShift中创建了许多小型临时表，作为ETL过程的一部分。每个表有50-100行（平均），约100列。当我查询每个登台表需要多少磁盘空间时，所有列占用的空间都相同。所占用的空间远远超过所需的空间。例如，6个MB用于59个BOOLEAN值。我尝试了多种排列：

列数据类型（varchar，timestamp等）
列编码（lzo，bytedict等）
加载样式（单个插入，深层复制等）
在上述所有步骤之间重复VACUUM

似乎没有什么能改变这些登台表所需的空间量。为什么RedShift不会更积极地压缩这些表？我可以在RedShift中配置它吗？或者我应该只是强迫一切都在一个大的临时表中？

我使用此查询来确定磁盘空间：

select name
    , col
    , sum(num_values) as num_values
    , count(blocknum) as size_in_mb
from svv_diskusage
group by name
    , col

Answer 1

Since the blocksize in RedShift is 1MB all columns will take up 1MB per column at a minimum. On top of this if the DISTSTYLE is EVEN it will be closer to one block per slice in the database. Since there is no way to tweak the blocksize in RedShift there is no way to reduce the size of an empty table below (number of columns) * (slices containing data for each column) * 1MB.

Answer 2

基本上，

对于使用KEY或EVEN分发样式创建的表：

Minimum table size = block_size (1 MB) * (number_of_user_columns + 3 system columns) * number_of_populated_slices * number_of_table_segments

对于使用ALL分发样式创建的表：

Minimum table size = block_size (1 MB) * (number_of_user_columns + 3 system columns) * number_of_cluster_nodes * number_of_table_segments

number_of_table_segments对于未排序的表是1，对于使用排序键定义的表是2。

Redshift表，所有列都占用相同的磁盘空间

2 个答案: