Question

我们现在已经使用Cassandra一段时间了，我们正在努力获得一个真正优化的表，它将能够快速查询和过滤大约100k行。

我们的模型看起来像这样：

class FailedCDR(Model):  
    uuid = columns.UUID(partition_key=True, primary_key=True)
    num_attempts = columns.Integer(index=True)
    datetime = columns.Integer()

如果我描述该表，它清楚地表明num_attempts是索引。

CREATE TABLE cdrs.failed_cdrs (
    uuid uuid PRIMARY KEY,
    datetime int,
    num_attempts int
) WITH bloom_filter_fp_chance = 0.01
    AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
    AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99.0PERCENTILE';
CREATE INDEX index_failed_cdrs_num_attempts ON cdrs.failed_cdrs (num_attempts);

我们希望能够运行类似于此的过滤器：

failed = FailedCDR.filter(num_attempts__lte=9)

但这种情况发生了：

QueryException: Where clauses require either a "=" or "IN" comparison with either a primary key or indexed field

我们如何完成类似的任务？

Answer 1

如果要在CQL中进行范围查询，则需要将该字段作为聚类列。

因此，您希望num_attempts字段成为聚类列。

此外，如果要执行单个查询，则需要在同一分区中查询所有行（或者可以使用IN子句访问的少量分区）。由于您只有100K行，所以它足够小以适合一个分区。

所以你可以像这样定义你的表：

CREATE TABLE test.failed_cdrs (
    partition int,
    num_attempts int,
    uuid uuid,
    datetime int,
    PRIMARY KEY (partition, num_attempts, uuid));

您可以使用分区键的常量插入数据，例如1。

INSERT INTO failed_cdrs (uuid, datetime, num_attempts, partition)
    VALUES ( now(), 123, 5, 1);

然后你可以这样做范围查询：

SELECT * from failed_cdrs where partition=1 and num_attempts >=8;

此方法的缺点是要更改num_attempts的值，您需要删除旧行并插入新行，因为不允许更新关键字段。您可以在批处理语句中执行删除和插入。

Cassandra 3.0中可用的更好的选项是创建一个具有num_attempts作为聚类列的物化视图，在这种情况下，Cassandra会在您更新基表中的num_attempts时为您处理删除和插入。 3.0版本目前正在进行beta测试。

Cassandra过滤器基于二级索引

1 个答案: