Question

假设我有一个表格（没有没有自动重新整理），该表格不是经过特别精心整理的：

create or replace table recluster_test3  
(
    id NUMBER
    ,value NUMBER
    ,value_str VARCHAR
)
cluster by (value)
;
alter table recluster_test3 suspend recluster; -- no automatic reclustering
describe table recluster_test3;

insert into recluster_test3  (
 select seq4() as id
         ,uniform(1,20, random()) as value
         ,randstr(10000,random()) as value_str
 FROM TABLE(GENERATOR(ROWCOUNT => 500000)) 
);
show tables like 'recluster_test%';

此新创建表的集群信息：

select system$clustering_information('recluster_test3');
{
  "average_depth": 367,
  "average_overlaps": 366,
  "cluster_by_keys": "LINEAR(VALUE)",
  "partition_depth_histogram": {
    "00000": 0,
...
    "00512": 367
  },
  "total_constant_partition_count": 0,
  "total_partition_count": 367
}

我可以通过

手动r ecluster the table

create or replace table recluster_test4 clone recluster_test3;
alter table recluster_test4 suspend recluster; -- no automatic reclustering
alter table recluster_test4 recluster; -- recluster manually,
select system$clustering_information('recluster_test4');
{
  "cluster_by_keys" : "LINEAR(VALUE)",
  "total_partition_count" : 394,
  "total_constant_partition_count" : 376,
  "average_overlaps" : 1.7778,
  "average_depth" : 2.0,
  "partition_depth_histogram" : {
    "00000" : 0,
    "00001" : 376,
    "00002" : 18,
    "00003" : 0,
...
  }
}

此重新整理尚未完成（可以到达"00001": 367）。有什么方法可以强制进行更完整的重新整理？

尽管在这种情况下，重新聚类效果非常好，但是在具有190TB行和4000000M行的真实数据集中，每个重新聚类都不能从根本上改善聚类深度。

所以真正的问题是

alter table xxx recluster的限制是什么？我相信对一次可以重新存储多少数据以及在每个重新存储上花费多少时间有严格的限制。

注意：由于成本方面的考虑（数据不断添加到表中），禁用了自动重新整理，并且自动重新整理消耗了大量的雪花信用。

Answer 1

每次运行将重新组合多少有限制，但是Snowflake无法共享确切的数字-我猜测的依据是：

虚拟仓库的大小
桌子的大小
表上的聚类统计

不建议使用手动群集，因此，最好先与Snowflake支持人员联系，这可能是一个好主意，您可能会发现自己走了一条将来将不再受支持的路线。

另外，当您运行alter table <table> recluster命令时，我相信您可以提供一个谓词，该谓词将限制重新整理的数据量（例如日期？）或MAX_SIZE = <budget_in_bytes>

在雪花中手动重新整理表格有哪些限制？

1 个答案: