Cassandra表设计用于使用ORDER,LIMIT和IN谓词进行查询

时间:2018-11-09 10:10:53

标签: sql database-design cassandra cql

我的数据如下:

select * from test;

 department | employee | batch_number | hash
------------+----------+--------------+-------
 dep1       | Bart     |            1 | hash1
 dep1       | Bart     |            1 | hash2
 dep1       | Lisa     |            3 | hash3
 dep1       | Lisa     |            4 | hash4
 dep1       | John     |            5 | hash5
 dep1       | Lucy     |            6 | hash6
 dep1       | Bart     |            7 | hash7
 dep1       | Bart     |            7 | hash8

我想用where上的batch_number子句,ordering上的batch_number和{{1}上的in谓词来查询数据}}。

在关系数据库中,这看起来像

employee

我在Cassandra中为此查询建模表格时遇到了一些问题。 select * from test where department='dep1' and employee in ('Bart','Lucy','John') and batch_number >= 2 order by batch_number desc limit 3; department | employee | batch_number | hash ------------+----------+--------------+------- dep1 | Bart | 7 | hash7 dep1 | Bart | 7 | hash8 dep1 | Lucy | 6 | hash6 将成为我的分区键,而department必须成为主键的一部分。但是我正在为群集键和/或(附有(附有SSTable的)辅助索引)苦苦挣扎。

因为我想在hash上订购,所以我尝试将其作为群集密钥:

batch_number

但是,这不允许在我的索引上带有CREATE TABLE keyspace.test( department TEXT, batch_number INT, hash TEXT, employee TEXT, PRIMARY KEY ((department), batch_number, hash) ) WITH CLUSTERING ORDER BY (batch_number DESC); CREATE INDEX tst_emp ON keyspace.test (employee); 谓词的查询:

in

所以我也尝试将select * from keyspace.test where department='dep1' and employee in ('Bart','Lucy','John'); InvalidRequest: Error from server: code=2200 [Invalid query] message="IN predicates on non-primary-key columns (employee) is not yet supported" 列添加为集群键:

employee

但这失败了,因为我不能在CREATE TABLE keyspace.test( department TEXT, batch_number INT, hash TEXT, employee TEXT, PRIMARY KEY ((department), batch_number, hash, employee) ) WITH CLUSTERING ORDER BY (batch_number DESC); 上放置非EQ关系:

batch_number

但是,每当我将select * from keyspace.test where department='dep1' and batch_number > 1 and employee in ('Bart','Lucy','John'); InvalidRequest: Error from server: code=2200 [Invalid query] message="Clustering column "employee" cannot be restricted (preceding column "batch_number" is restricted by a non-EQ relation)" 放在employee之前,我都会失去订购batch_number的能力:

batch_number

那么哪种表设计可以进行这样的查询? 可以在Cassandra中完成吗?

编辑:

我希望能够在此表上运行的其他查询是:

CREATE TABLE keyspace.test(
    department      TEXT,
    employee        TEXT,
    batch_number    INT,
    hash            TEXT,
    PRIMARY KEY ((department), employee, batch_number, hash)
);

select * from keyspace.test where department='dep1' and employee in ('Bart','Lucy','John') ORDER BY batch_number DESC;
InvalidRequest: Error from server: code=2200 [Invalid query] message="Order by currently only support the ordering of columns following their declared order in the PRIMARY KEY"

select * from keyspace.test where department='X' and batch_number=Y 

2 个答案:

答案 0 :(得分:2)

使用实例化视图,您可以重新排列数据:

CREATE MATERIALIZED VIEW mv_test AS 
SELECT
   department,
   batch_number,
   employee,
   hash 
FROM
   test 
WHERE
   department IS NOT NULL 
   AND batch_number IS NOT NULL 
   AND employee IS NOT NULL 
   AND hash IS NOT NULL 
PRIMARY KEY (department, employee, batch_number, hash) 
WITH clustering 
ORDER BY
(batch_number DESC);

我可以执行以下查询:

SELECT * FROM mv_test 
WHERE
   department = 'dep1' 
   AND employee IN 
   (
      'Bart',
      'Lisa'
   )
   AND batch_number > 3;

根据聚类顺序对结果进行排序:

 department | employee | batch_number | hash
------------+----------+--------------+-------
       dep1 |     Bart |            7 | hash7
       dep1 |     Bart |            7 | hash8
       dep1 |     Lisa |            4 | hash4

尽管>子句是非等号子句,但是IN虽然具有多个值,但仍具有确定性,这就是为什么我相信您可以毫无问题地过滤键的原因。 由于batch_number是您要过滤的最后一件事,因此允许使用任何类型的where子句。我假设您一直有department

请注意,实例化视图impact performance。更具体地说,写性能。但是,与ALLOW FILTERING相比,读取性能是有益的。

更新:

在物化视图末尾指定的顺序为batch_number,但是,它将首先在department上排序,然后依次在employeebatch_number上排序,因此不能保证batch_number的顺序。据我所知,这是没有办法的。另一种数据库解决方案可能更可取。

更新2:

如Apache邮件链中所述(请参阅下面的评论),实例化视图不被视为可用于生产。但是,Datastax认为它们是可用的,但要注意使用上述最佳做法。就个人而言,我对物化视图没有任何麻烦。当然,这是一个简单的单个数据中心集群,并且考虑到最佳实践提到了更复杂的设置,因此在这种情况下它们可能会崩溃。

答案 1 :(得分:1)

如果需要,您可以在employee上使用索引,甚至可以将其从主键中删除。您可能需要放弃使用IN,但可以拆分查询并将结果加入客户端。

CREATE TABLE tk.test_good(
    department      TEXT,
    batch_number    INT,
    employee        TEXT,
    hash            TEXT,
    PRIMARY KEY ((department), batch_number, hash)
)WITH CLUSTERING ORDER BY (batch_number DESC);

CREATE INDEX IF NOT EXISTS employee_idx ON tk.test_good ( employee );

select * from tk.test_good where department='dep1' and employee='Bart' and batch_number >= 2 limit 3;
select * from tk.test_good where department='dep1' and employee='Lucy' and batch_number >= 2 limit 3;
select * from tk.test_good where department='dep1' and employee='John' and batch_number >= 2 limit 3;

此方法的缺点是索引可能太大。但是我不知道数据池的大小,因此请您自行考虑。