我的数据如下:
select * from test;
department | employee | batch_number | hash
------------+----------+--------------+-------
dep1 | Bart | 1 | hash1
dep1 | Bart | 1 | hash2
dep1 | Lisa | 3 | hash3
dep1 | Lisa | 4 | hash4
dep1 | John | 5 | hash5
dep1 | Lucy | 6 | hash6
dep1 | Bart | 7 | hash7
dep1 | Bart | 7 | hash8
我想用where
上的batch_number
子句,ordering
上的batch_number
和{{1}上的in
谓词来查询数据}}。
在关系数据库中,这看起来像
employee
我在Cassandra中为此查询建模表格时遇到了一些问题。 select * from test
where department='dep1'
and employee in ('Bart','Lucy','John')
and batch_number >= 2
order by batch_number desc
limit 3;
department | employee | batch_number | hash
------------+----------+--------------+-------
dep1 | Bart | 7 | hash7
dep1 | Bart | 7 | hash8
dep1 | Lucy | 6 | hash6
将成为我的分区键,而department
必须成为主键的一部分。但是我正在为群集键和/或(附有(附有SSTable的)辅助索引)苦苦挣扎。
因为我想在hash
上订购,所以我尝试将其作为群集密钥:
batch_number
但是,这不允许在我的索引上带有CREATE TABLE keyspace.test(
department TEXT,
batch_number INT,
hash TEXT,
employee TEXT,
PRIMARY KEY ((department), batch_number, hash)
) WITH CLUSTERING ORDER BY (batch_number DESC);
CREATE INDEX tst_emp ON keyspace.test (employee);
谓词的查询:
in
所以我也尝试将select * from keyspace.test where department='dep1' and employee in ('Bart','Lucy','John');
InvalidRequest: Error from server: code=2200 [Invalid query] message="IN predicates on non-primary-key columns (employee) is not yet supported"
列添加为集群键:
employee
但这失败了,因为我不能在CREATE TABLE keyspace.test(
department TEXT,
batch_number INT,
hash TEXT,
employee TEXT,
PRIMARY KEY ((department), batch_number, hash, employee)
) WITH CLUSTERING ORDER BY (batch_number DESC);
上放置非EQ关系:
batch_number
但是,每当我将select * from keyspace.test where department='dep1' and batch_number > 1 and employee in ('Bart','Lucy','John');
InvalidRequest: Error from server: code=2200 [Invalid query] message="Clustering column "employee" cannot be restricted (preceding column "batch_number" is restricted by a non-EQ relation)"
放在employee
之前,我都会失去订购batch_number
的能力:
batch_number
那么哪种表设计可以进行这样的查询? 可以在Cassandra中完成吗?
编辑:
我希望能够在此表上运行的其他查询是:
CREATE TABLE keyspace.test(
department TEXT,
employee TEXT,
batch_number INT,
hash TEXT,
PRIMARY KEY ((department), employee, batch_number, hash)
);
select * from keyspace.test where department='dep1' and employee in ('Bart','Lucy','John') ORDER BY batch_number DESC;
InvalidRequest: Error from server: code=2200 [Invalid query] message="Order by currently only support the ordering of columns following their declared order in the PRIMARY KEY"
和
select * from keyspace.test where department='X' and batch_number=Y
答案 0 :(得分:2)
使用实例化视图,您可以重新排列数据:
CREATE MATERIALIZED VIEW mv_test AS
SELECT
department,
batch_number,
employee,
hash
FROM
test
WHERE
department IS NOT NULL
AND batch_number IS NOT NULL
AND employee IS NOT NULL
AND hash IS NOT NULL
PRIMARY KEY (department, employee, batch_number, hash)
WITH clustering
ORDER BY
(batch_number DESC);
我可以执行以下查询:
SELECT * FROM mv_test
WHERE
department = 'dep1'
AND employee IN
(
'Bart',
'Lisa'
)
AND batch_number > 3;
根据聚类顺序对结果进行排序:
department | employee | batch_number | hash
------------+----------+--------------+-------
dep1 | Bart | 7 | hash7
dep1 | Bart | 7 | hash8
dep1 | Lisa | 4 | hash4
尽管>
子句是非等号子句,但是IN
虽然具有多个值,但仍具有确定性,这就是为什么我相信您可以毫无问题地过滤键的原因。
由于batch_number
是您要过滤的最后一件事,因此允许使用任何类型的where子句。我假设您一直有department
。
请注意,实例化视图impact performance。更具体地说,写性能。但是,与ALLOW FILTERING
相比,读取性能是有益的。
更新:
在物化视图末尾指定的顺序为batch_number
,但是,它将首先在department
上排序,然后依次在employee
和batch_number
上排序,因此不能保证batch_number
的顺序。据我所知,这是没有办法的。另一种数据库解决方案可能更可取。
更新2:
如Apache邮件链中所述(请参阅下面的评论),实例化视图不被视为可用于生产。但是,Datastax认为它们是可用的,但要注意使用上述最佳做法。就个人而言,我对物化视图没有任何麻烦。当然,这是一个简单的单个数据中心集群,并且考虑到最佳实践提到了更复杂的设置,因此在这种情况下它们可能会崩溃。
答案 1 :(得分:1)
如果需要,您可以在employee
上使用索引,甚至可以将其从主键中删除。您可能需要放弃使用IN
,但可以拆分查询并将结果加入客户端。
CREATE TABLE tk.test_good(
department TEXT,
batch_number INT,
employee TEXT,
hash TEXT,
PRIMARY KEY ((department), batch_number, hash)
)WITH CLUSTERING ORDER BY (batch_number DESC);
CREATE INDEX IF NOT EXISTS employee_idx ON tk.test_good ( employee );
select * from tk.test_good where department='dep1' and employee='Bart' and batch_number >= 2 limit 3;
select * from tk.test_good where department='dep1' and employee='Lucy' and batch_number >= 2 limit 3;
select * from tk.test_good where department='dep1' and employee='John' and batch_number >= 2 limit 3;
此方法的缺点是索引可能太大。但是我不知道数据池的大小,因此请您自行考虑。