Question

我的数据模型： -

tid                                  | codes        | raw          | type
-------------------------------------+--------------+--------------+------
a64fdd60-1bc4-11e5-9b30-3dca08b6a366 | {12, 34, 53} | {sdafb=safd} |  cmd

CREATE TABLE MyTable (
tid       TIMEUUID,
type      TEXT,
codes     SET<INT>,
raw       TEXT,
PRIMARY KEY (tid)
);
CREATE INDEX ON myTable (codes);

如何查询表以根据多个设置值返回行。

这适用： -

select * from logData where codes contains 34;

但我希望根据多个设定值获取行，但这些都不起作用： -

select * from logData where codes contains 34, 12; or 
select * from logData where codes contains 34 and 12; or
select * from logData where codes contains {34, 12};

请帮助。

Answer 1

如果我创建表结构并在上面插入类似的行，我可以检查codes集合中的多个值，如下所示：

aploetz@cqlsh:stackoverflow2> SELECT * FROM mytable 
    WHERE codes CONTAINS 34 
      AND codes CONTAINS 12
      ALLOW FILTERING;

 tid                                  | codes        | raw          | type
--------------------------------------+--------------+--------------+------
 2569f270-1c06-11e5-92f0-21b264d4c94d | {12, 34, 53} | {sdafb=safd} |  cmd

(1 rows)

正如其他人所说，让我也告诉你 为什么这是一个糟糕的主意 ...

如果集合上有二级索引（并且基数看起来相当高），则必须为每个查询检查每个节点。 Cassandra的想法是尽可能经常地通过分区键进行查询，这样你只需要为每个查询命中一个节点。 Apple的Richard Low撰写了一篇名为The sweet spot for Cassandra secondary indexes的精彩文章。它应该让你重新思考使用二级索引的方式。

其次，我可以让Cassandra接受此查询的唯一方法是使用ALLOW FILTERING。这意味着，Cassandra可以应用所有fitlering标准（WHERE子句）的唯一方法是拉回每一行并单独过滤掉不符合条件的行。非常低效。需要说明的是，ALLOW FILTERING指令应该永远不会使用。

在任何情况下，如果codes是您需要查询的内容，那么您应该设计一个额外的查询表，codes作为PRIMARY KEY的一部分。

Answer 2

您使用的数据模型效率极低。集合用于获取给定主键的一组数据，而不是相反的方式。如果需要，您将不得不重新考虑模型本身。

我建议为您在集合中使用的每个值创建不同的列，然后将这些列用作复合主键。

Answer 3

Are you really looking to get ALL log entries based on just codes? That could be quite a large dataset. Realistically, wouldn't you be looking at specific dates / date ranges? I'd key on that, and then use codes for filtering, or even filter on codes entirely on the client side.

If you have many codes, and you index on the sets, it might result in very high cardinality of the index, which would cause you issues. Whether you have your own lookup table, or use an index, remember that you essentially have a "table" where the pk is the value, and there are rows for that value for every "row" that matches the value. If that looks unacceptably large, then that's exactly what it is.

I'd recommend revisiting the requirement - again...do you really need all log entries EVER that match a certain code combination?

If you really do need to analyse the whole lot, then I'd recommend using Spark to run the job. You could then run a Spark job, and each node would deal with data on the same node; this will significantly reduce the impact compared to doing full table processing entirely in the application.

Answer 4

我知道现在已经很晚了。 IMO模型只需很少的改动即可达到预期目标。可以做的是拥有与要查询的集合的幂集成员一样多的行。

CREATE TABLE data_points_ks.mytable (
    codes frozen<set<int>>,
    tid timeuuid,
    raw text,
    type text,
    PRIMARY KEY (codes, tid)
) WITH CLUSTERING ORDER BY (tid ASC)

INSERT INTO mytable (tid, codes, raw, type) VALUES (now(), {}, '{sdafb=safd}', 'cmd');
INSERT INTO mytable (tid, codes, raw, type) VALUES (now(), {12}, '{sdafb=safd}', 'cmd');
INSERT INTO mytable (tid, codes, raw, type) VALUES (now(), {34}, '{sdafb=safd}', 'cmd');
INSERT INTO mytable (tid, codes, raw, type) VALUES (now(), {12, 34}, '{sdafb=safd}', 'cmd');
INSERT INTO mytable (tid, codes, raw, type) VALUES (now(), {53}, '{sdafb=safd}', 'cmd');
INSERT INTO mytable (tid, codes, raw, type) VALUES (now(), {12, 53}, '{sdafb=safd}', 'cmd');
INSERT INTO mytable (tid, codes, raw, type) VALUES (now(), {34, 53}, '{sdafb=safd}', 'cmd');
INSERT INTO mytable (tid, codes, raw, type) VALUES (now(), {12, 34, 53}, '{sdafb=safd}', 'cmd');

 tid                                  | codes        | raw          | type
--------------------------------------+--------------+--------------+------
 8ae81763-1142-11e8-846c-cd9226c29754 |     {34, 53} | {sdafb=safd} |  cmd
 8746adb3-1142-11e8-846c-cd9226c29754 |     {12, 53} | {sdafb=safd} |  cmd
 fea77062-1142-11e8-846c-cd9226c29754 |         {34} | {sdafb=safd} |  cmd
 70ebb790-1142-11e8-846c-cd9226c29754 |     {12, 34} | {sdafb=safd} |  cmd
 6c39c843-1142-11e8-846c-cd9226c29754 |         {12} | {sdafb=safd} |  cmd
 65a954f3-1142-11e8-846c-cd9226c29754 |         null | {sdafb=safd} |  cmd
 03c60433-1143-11e8-846c-cd9226c29754 |         {53} | {sdafb=safd} |  cmd
 82f68d70-1142-11e8-846c-cd9226c29754 | {12, 34, 53} | {sdafb=safd} |  cmd

然后以下查询就足够了，不需要任何过滤。

SELECT * FROM mytable 
WHERE codes = {12, 34};

OR

SELECT * FROM mytable 
WHERE codes = {34};

Cassandra CQL where子句有多个集合值？

4 个答案: