群集,在注入之前对数据进行排序会提高截断表的性能吗?

时间:2019-12-12 01:33:02

标签: snowflake-data-warehouse

我的目标是摄取在特定列上排序的数据,以便分区也按该顺序进行,以使该列上的修剪效率更高。

我想最大程度地减少排序成本,并希望获得有关我应该多久重新整理一次的一些指导。

例如:

CREATE TABLE test_order(n NUMBER, s STRING);
INSERT INTO test_order 
VALUES 
   (12, 'a'), 
   (11, 'b'), 
   (10, 'c'), 
   (9, 'd'), 
   (8, 'e'), 
   (7, 'f'), 
   (6, 'g'), 
   (5, 'h'), 
   (6, 'i'), 
   (5, 'j'), 
   (4, 'k'), 
   (3, 'l'), 
   (2, 'm'), 
   (1, 'n');

SELECT * FROM test_order 
ORDER BY n ASC;

ALTER TABLE test_order CLUSTER BY (n, s);
ALTER TABLE test_order RECLUSTER;

SELECT n, s FROM test_order;
SELECT SYSTEM$CLUSTERING_INFORMATION('test_order', '(n,s)');
  

这是第一次插入的信息:

{
  "cluster_by_keys" : "LINEAR(N, S)",
  "total_partition_count" : 1,
  "total_constant_partition_count" : 0,
  "average_overlaps" : 0.0,
  "average_depth" : 1.0,
  "partition_depth_histogram" : {
    "00000" : 0,
    "00001" : 1,
    "00002" : 0,
    "00003" : 0,
    "00004" : 0,
    "00005" : 0,
    "00006" : 0,
    "00007" : 0,
    "00008" : 0,
    "00009" : 0,
    "00010" : 0,
    "00011" : 0,
    "00012" : 0,
    "00013" : 0,
    "00014" : 0,
    "00015" : 0,
    "00016" : 0
  }
}

  

这是第二次插入的信息:

INSERT INTO test_order 
VALUES 
   (12, 'p'), 
   (11, 'f'), 
   (10, 'z'), 
   (9, 'y'), 
   (8, 'x'), 
   (7, 'w'), 
   (6, 'v'), 
   (5, 'u'), 
   (6, 't'), 
   (5, 's'), 
   (4, 'r'), 
   (3, 'q'), 
   (2, 'p'), 
   (1, 'o');

{
  "cluster_by_keys" : "LINEAR(N, S)",
  "total_partition_count" : 2,
  "total_constant_partition_count" : 0,
  "average_overlaps" : 1.0,
  "average_depth" : 2.0,
  "partition_depth_histogram" : {
    "00000" : 0,
    "00001" : 0,
    "00002" : 2,
    "00003" : 0,
    "00004" : 0,
    "00005" : 0,
    "00006" : 0,
    "00007" : 0,
    "00008" : 0,
    "00009" : 0,
    "00010" : 0,
    "00011" : 0,
    "00012" : 0,
    "00013" : 0,
    "00014" : 0,
    "00015" : 0,
    "00016" : 0
  }
}

然后第二个重新出现:

{
  "cluster_by_keys" : "LINEAR(N, S)",
  "total_partition_count" : 2,
  "total_constant_partition_count" : 0,
  "average_overlaps" : 1.0,
  "average_depth" : 2.0,
  "partition_depth_histogram" : {
    "00000" : 0,
    "00001" : 0,
    "00002" : 2,
    "00003" : 0,
    "00004" : 0,
    "00005" : 0,
    "00006" : 0,
    "00007" : 0,
    "00008" : 0,
    "00009" : 0,
    "00010" : 0,
    "00011" : 0,
    "00012" : 0,
    "00013" : 0,
    "00014" : 0,
    "00015" : 0,
    "00016" : 0
  }
}

对不起,我是格式化的新手,但是在插入特定顺序后,聚类比率没有太大变化-是因为我的数据集示例太小,还是顺序对于聚类性能无关紧要?

1 个答案:

答案 0 :(得分:1)

如果您要提取排序的数据,我认为您不需要对表进行聚类。您的数据将自然聚类,并且将得到所需的修剪。