根据Gunwant的要求,我想提供有关我的问题的更多信息。
我有一个> 10 ^ 7行的数据库。每行是具有许多不同属性(列)的产品,例如,标题,描述,价格,重量,颜色,体积,仓库位置等等。但是,所有这些属性都可能会发生变化 - 价格可能上涨或下跌,描述可能会发生变化,可能会移动到仓库中的其他位置等。所有数据都是历史存储的,例如:
description | date | price | warehouse_location | color
Cucumber | 2017-01-14 | 50 | 23A | green
Cucumber | 2017-01-16 | 55 | 23A | green
Cucumber | 2017-01-19 | 52 | 14B | green
Pineapple | 2017-01-12 | 80 | 23A | yellow
Pineapple | 2017-01-17 | 75 | 23A | yellow
Pineapple | 2017-01-22 | 80 | 23A | yellow
Lemon | 2017-01-18 | 60 | 9C | yellow
Lemon | 2017-01-19 | 70 | 33E | yellow
Lemon | 2017-01-20 | 80 | 9A | yellow
我现在想要创建任意报告,我需要能够过滤每一列。
例如:2017-01-12至2017-01-18期间仓库位置为23A的所有对象的价格。如果同一对象对于给定查询具有多个匹配项,则只应返回该时间跨度内的最新条目。在这种情况下,“黄瓜”应返回“55”,菠萝应返回“75”。
我需要能够一次过滤多个列。另一个例子是“所有物品的颜色,价格> 60,价格< 90,日期> 2017-01-11,日期< 2017-01-22”,应返回{yellow;黄色}用于上述数据集。
原始问题:
我想将历史数据存储在Cassandra数据库中:
objectid | date | price | foo
1 | 2017-01-18 | 200 | A
1 | 2017-01-19 | 300 | A
1 | 2017-01-20 | 400 | B
2 | 2017-01-18 | 100 | C
2 | 2017-01-19 | 150 | C
2 | 2017-01-20 | 200 | D
3 | 2017-01-18 | 400 | E
3 | 2017-01-19 | 350 | E
3 | 2017-01-20 | 300 | F
我现在想要为满足条件的每个对象的“foo”列选择最新条目。例如,对于300到500之间的查询价格,我想获得以下内容:
objectid | date | price | foo
1 | 2017-01-20 | 400 | B
3 | 2017-01-18 | 400 | E
这些查询是否可以在Cassandra中进行?
编辑: 谢谢大家的努力。如果你只想获得foo的独特价值,MarkoŠvaljek的答案似乎有效。在我的用例中,我有几十个不同的“foo列”和> 10 ^ 7行。我显然必须创建数百个不同的“报告”表以允许任意过滤 - 我不确定Cassandra是否是该用例的正确解决方案。
答案 0 :(得分:3)
与cassandra一样,你需要对此进行反规范化。我会假设 您的基表如下所示:
create table base (
objectid int,
date timestamp,
price int,
foo text,
primary key (objectid, date)
);
请注意这个创建语句,因为 历史数据通常会超过10万
然后我创建了以下插入语句:
insert into base (objectid, date, price, foo) values (1, '2017-01-18', 200, 'A');
insert into base (objectid, date, price, foo) values (1, '2017-01-19', 300, 'A');
insert into base (objectid, date, price, foo) values (1, '2017-01-20', 400, 'B');
insert into base (objectid, date, price, foo) values (2, '2017-01-18', 100, 'C');
insert into base (objectid, date, price, foo) values (2, '2017-01-19', 150, 'C');
insert into base (objectid, date, price, foo) values (2, '2017-01-20', 200, 'D');
insert into base (objectid, date, price, foo) values (3, '2017-01-18', 400, 'E');
insert into base (objectid, date, price, foo) values (3, '2017-01-19', 350, 'E');
insert into base (objectid, date, price, foo) values (3, '2017-01-20', 300, 'F');
无法从开箱即可获得您想要的查询。但你可以去 在它周围。
您需要创建另一个表:
create table report (
report text,
price int,
objectid int,
date timestamp,
foo text,
primary key (report, price, foo)
);
-- in cassandra if you want to search for something it has to go into clustering columns
-- and price is your first goal ... foo is there just for uniqueness
-- now you do inserts with data that you have above
-- perfectly o.k. to create multiple inserts in cassandra
insert into report (report, objectid, date, price, foo) values ('latest', 1, '2017-01-18', 200, 'A');
insert into report (report, objectid, date, price, foo) values ('latest', 1, '2017-01-19', 300, 'A');
insert into report (report, objectid, date, price, foo) values ('latest', 1, '2017-01-20', 400, 'B');
insert into report (report, objectid, date, price, foo) values ('latest', 2, '2017-01-18', 100, 'C');
insert into report (report, objectid, date, price, foo) values ('latest', 2, '2017-01-19', 150, 'C');
insert into report (report, objectid, date, price, foo) values ('latest', 2, '2017-01-20', 200, 'D');
insert into report (report, objectid, date, price, foo) values ('latest', 3, '2017-01-18', 400, 'E');
insert into report (report, objectid, date, price, foo) values ('latest', 3, '2017-01-19', 350, 'E');
insert into report (report, objectid, date, price, foo) values ('latest', 3, '2017-01-20', 300, 'F');
这会让你回头:
select objectid, date, price, foo from report where report='latest' and price > 300 and price < 500;
objectid | date | price | foo
----------+---------------------------------+-------+-----
3 | 2017-01-18 23:00:00.000000+0000 | 350 | E
1 | 2017-01-19 23:00:00.000000+0000 | 400 | B
3 | 2017-01-17 23:00:00.000000+0000 | 400 | E
这不是你想要的。你现在有几个选择。
基本上,如果您从主键中排除价格,您将获得:
create table report2 (
report text,
price int,
objectid int,
date timestamp,
foo text,
primary key (report, foo)
);
insert into report2 (report, objectid, date, price, foo) values ('latest', 1, '2017-01-18', 200, 'A');
insert into report2 (report, objectid, date, price, foo) values ('latest', 1, '2017-01-19', 300, 'A');
insert into report2 (report, objectid, date, price, foo) values ('latest', 1, '2017-01-20', 400, 'B');
insert into report2 (report, objectid, date, price, foo) values ('latest', 2, '2017-01-18', 100, 'C');
insert into report2 (report, objectid, date, price, foo) values ('latest', 2, '2017-01-19', 150, 'C');
insert into report2 (report, objectid, date, price, foo) values ('latest', 2, '2017-01-20', 200, 'D');
insert into report2 (report, objectid, date, price, foo) values ('latest', 3, '2017-01-18', 400, 'E');
insert into report2 (report, objectid, date, price, foo) values ('latest', 3, '2017-01-19', 350, 'E');
insert into report2 (report, objectid, date, price, foo) values ('latest', 3, '2017-01-20', 300, 'F');
select objectid, date, price, foo from report2 where report='latest';
objectid | date | price | foo
----------+---------------------------------+-------+-----
1 | 2017-01-18 23:00:00.000000+0000 | 300 | A
1 | 2017-01-19 23:00:00.000000+0000 | 400 | B
2 | 2017-01-18 23:00:00.000000+0000 | 150 | C
2 | 2017-01-19 23:00:00.000000+0000 | 200 | D
3 | 2017-01-18 23:00:00.000000+0000 | 350 | E
3 | 2017-01-19 23:00:00.000000+0000 | 300 | F
如果你没有太多的foo,你可以通过在客户端过滤它来逃脱它,但大多数 当时这是反模式。
您也可以使用查询:
select objectid, date, price, foo from report2 where report='latest' and price > 300 and price < 500 allow filtering;
objectid | date | price | foo
----------+---------------------------------+-------+-----
1 | 2017-01-19 23:00:00.000000+0000 | 400 | B
3 | 2017-01-18 23:00:00.000000+0000 | 350 | E
哪个不理想,但它有点有用。
我最近创建分区的原因是分区保留在同一主机上。根据 在你得到的工作量上,这可能会成为你的热门话题。
这或多或少是故事的关系方面......
如果您真的使用cassandra,您必须预先准备好视图。所以你会得到报告2 但是会插入你想要出去的每个统计组的数据,即
insert into report2 (report, objectid, date, price, foo) values ('300-500', 1, '2017-01-19', 300, 'A');
... and so on
然后你会这样做:
select objectid, date, price, foo from report2 where report='300-500'
但我想你想动态设置范围,所以这不是你想要的。这或多或少是基本的cassandra所做的。
然后总是有物化的观点(目前他们有一些问题),我个人不会将它们与一些超级重要的报道一起使用。
如果访问模式未知,总会有apache spark或一些脚本解决方案来检查数据并创建所需的视图。
答案 1 :(得分:0)
create table report (
report text,
price int,
objectid int,
date timestamp,
foo text,
primary key ((report, price), foo)
);
你可以运行像
这样的查询select * from report where token (report,price) > token('latest',200) and token (report,price) < token('latest',300);
这应该给你200到300的价格范围。
问题更改后的修改 -
create table product_history (
description text,
createdOn bigint,
warehouse_location text,
color text,
price int ,
primary key (description,createdOn, warehouse_location,price)
);
使用上表保留历史记录。
列出您的查询 -
1 - Price of all objects with warehouse_location 23A from 2017-01-12 to 2017-01-18
使用此表 -
create table product_by_latest_price (
description text, -- I believe description of the product is not going to change, otherwise , use some UUID per product.
createdOn bigint,
warehouse_location text,
color text,
price int static, -- this is shared value, per description.
primary key (description,createdOn, warehouse_location)
);
对于此表,您没有使用更新查询,只需继续插入。例如 -
insert into product_by_latest_price (description, createdOn, warehouse_location, color,price) values (Cucumber , '2017-01-12', 23A, 'green',100);
insert into product_by_latest_price (description, createdOn, warehouse_location, color,price) values (Cucumber , '2017-01-18', 23A, 'green',200);
选择查询
select * from product_by_latest_price where warehouse_location = 23A and createdOn>= 2017-01-12 and 2017-01-12<=2017-01-18; result would be row with price 200.
价格&gt;所有对象的颜色60和价格&lt; 90和日期&gt; 2017-01-11和日期&lt; 2017年1月22日
create table product_by_date_price (
description text,
createdOn bigint,
warehouse_location text,
color text,
price int ,
primary key (createdOn,price)
);
select * from product_by_date_price where token(createdOn)> token(2017-01-12) and token(createdOn)<(2017-01-18) and price>60 and price < 80;
这将返回重复的行,您需要在应用程序级别过滤掉。
对于插入使用批次
BEGIN BATCH
insert into product_history .....
insert into product_by_latest_price .....
insert into product_by_date_price .....
END BATCh
我还没有对任何查询进行过测试,但这样做很有用。 Cassandra 3.0有意见,你可以利用意见。记住,根据您的查询设计您的表。不要犹豫,复制数据。
祝你好运。