编辑

Question

在Clickhouse中，我创建了一个具有嵌套结构的表

CREATE TABLE IF NOT EXISTS table_name (
    timestamp Date,
    str_1 String,
    Nested_structure Nested (
        index_array UInt32,
        metric_2 UInt64,
        metric_3 UInt8
    ),
    sign Int8 DEFAULT 1
) ENGINE = CollapsingMergeTree(sign) PARTITION BY (toYYYYMM(timestamp)) ORDER BY (timestamp, str_1)

我将要进行的查询如下：

 SELECT count(*) AS count FROM table_name
 WHERE (timestamp = '2017-09-01')
 AND
 arrayFirst((i, x) -> x = 7151, Nested_structure.metric_2, Nested_structure.index_array) > 50000

我要计算 str_1 ，其中：与 indexed_array 匹配的索引中的（array）列 metric_2 的值7151的值大于给定阈值（50000）

我想知道是否可以为列设置主键： index_array ，以便使查询更快。

如果我在order by子句中添加列： Nested_structure.index_array ，则假定它是大表的数组列，而不是Nested_structure列indexed_array的各个值

例如ORDER BY (timestamp, str_1, Nested_structure.index_array)

算法是：

在 index_array
具有步骤（1）中的索引，请从其他数组中检索值

如果对 index_array 进行了排序并且表对此有所了解，则步骤（1）可能会更快（例如，使用二进制搜索算法）

有人有主意吗？

=============

编辑

列的基数： str_1 15,000,000百万个不同的值 index_array ：15,000-20,000千个不同的值

假设index_array的不同值是：column_1，...，column_15000，则非规范化表应具有以下结构：

timestamp,
str_1,
column_1a, <--  store values for metric_2
...
column_15000a, <--  store values for metric_2
column_1b, <--  store values for metric_3
...
column_15000b, <--  store values for metric_3

@Amos ，如果我使用类型为 LowCardinality 的列，可以给我表格的结构吗？

Answer 1

我想知道是否可以为column：index_array设置主键，以便使查询更快。

不，ClickHouse没有数组索引。如果在Nested_structure.index_array子句中提供order by作为第三个参数，它将考虑到数组列而对整个行进行排序。请注意，[1,2] < [1,2,3]。

您只需对不带嵌套列的表进行规范化处理，然后将前两列的类型设为LowCardinality即可使用。

更新

似乎您不会从LowCardinality类型中受益匪浅。我的意思是做这样的事情

CREATE TABLE IF NOT EXISTS table_name (
    timestamp Date,
    str_1 String,
    index_array UInt32,
    metric_2 UInt64,
    metric_3 UInt8,
    sign Int8 DEFAULT 1
) ENGINE = CollapsingMergeTree(sign) PARTITION BY (toYYYYMM(timestamp)) ORDER BY (timestamp, str_1, index_array)

您仍然可以通过这样做使用旧的插入逻辑

CREATE TABLE IF NOT EXISTS table_name ( timestamp Date, str_1 String, index_array UInt32, metric_2 UInt64, metric_3 UInt8, sign Int8 DEFAULT 1 ) ENGINE = CollapsingMergeTree(sign) PARTITION BY (toYYYYMM(timestamp)) ORDER BY (timestamp, str_1, index_array)

CREATE TABLE IF NOT EXISTS source_table ( timestamp Date, str_1 String, Nested_structure Nested ( index_array UInt32, metric_2 UInt64, metric_3 UInt8 ), sign Int8 DEFAULT 1 ) ENGINE Null;

create materialized view data_pipe to table_name as select timestamp, str_1, Nested_structure.index_array index_array, Nested_structure.metric_2 metric_2, Nested_structure.metric_3 metric_3, sign from source_table array join Nested_structure;

insert into source_table values (today(), 'fff', [1,2,3], [2,3,4], [3,4,5], 1);

Clickhouse数据库上嵌套结构中的主键

编辑

1 个答案:

更新