获取Cassandra每组最新条目

时间:2017-01-20 16:20:00

标签: database cassandra

根据Gunwant的要求,我想提供有关我的问题的更多信息。

我有一个> 10 ^ 7行的数据库。每行是具有许多不同属性(列)的产品,例如,标题,描述,价格,重量,颜色,体积,仓库位置等等。但是,所有这些属性都可能会发生变化 - 价格可能上涨或下跌,描述可能会发生变化,可能会移动到仓库中的其他位置等。所有数据都是历史存储的,例如:

description |       date | price | warehouse_location |  color
   Cucumber | 2017-01-14 |    50 |                23A |  green
   Cucumber | 2017-01-16 |    55 |                23A |  green
   Cucumber | 2017-01-19 |    52 |                14B |  green
  Pineapple | 2017-01-12 |    80 |                23A | yellow
  Pineapple | 2017-01-17 |    75 |                23A | yellow
  Pineapple | 2017-01-22 |    80 |                23A | yellow
      Lemon | 2017-01-18 |    60 |                 9C | yellow
      Lemon | 2017-01-19 |    70 |                33E | yellow
      Lemon | 2017-01-20 |    80 |                 9A | yellow

我现在想要创建任意报告,我需要能够过滤每一列。

例如:2017-01-12至2017-01-18期间仓库位置为23A的所有对象的价格。如果同一对象对于给定查询具有多个匹配项,则只应返回该时间跨度内的最新条目。在这种情况下,“黄瓜”应返回“55”,菠萝应返回“75”。

我需要能够一次过滤多个列。另一个例子是“所有物品的颜色,价格> 60,价格< 90,日期> 2017-01-11,日期< 2017-01-22”,应返回{yellow;黄色}用于上述数据集。

原始问题:

我想将历史数据存储在Cassandra数据库中:

objectid |       date | price | foo
       1 | 2017-01-18 |   200 |   A
       1 | 2017-01-19 |   300 |   A
       1 | 2017-01-20 |   400 |   B
       2 | 2017-01-18 |   100 |   C
       2 | 2017-01-19 |   150 |   C
       2 | 2017-01-20 |   200 |   D
       3 | 2017-01-18 |   400 |   E
       3 | 2017-01-19 |   350 |   E
       3 | 2017-01-20 |   300 |   F

我现在想要为满足条件的每个对象的“foo”列选择最新条目。例如,对于300到500之间的查询价格,我想获得以下内容:

objectid |       date | price | foo
       1 | 2017-01-20 |   400 |   B
       3 | 2017-01-18 |   400 |   E

这些查询是否可以在Cassandra中进行?

编辑: 谢谢大家的努力。如果你只想获得foo的独特价值,MarkoŠvaljek的答案似乎有效。在我的用例中,我有几十个不同的“foo列”和> 10 ^ 7行。我显然必须创建数百个不同的“报告”表以允许任意过滤 - 我不确定Cassandra是否是该用例的正确解决方案。

2 个答案:

答案 0 :(得分:3)

与cassandra一样,你需要对此进行反规范化。我会假设 您的基表如下所示:

create table base (
    objectid int,
    date timestamp,
    price int,
    foo text,
    primary key (objectid, date)
);

请注意这个创建语句,因为 历史数据通常会超过10万

然后我创建了以下插入语句:

 insert into base (objectid, date, price, foo) values (1, '2017-01-18', 200, 'A');
 insert into base (objectid, date, price, foo) values (1, '2017-01-19', 300, 'A');
 insert into base (objectid, date, price, foo) values (1, '2017-01-20', 400, 'B');
 insert into base (objectid, date, price, foo) values (2, '2017-01-18', 100, 'C');
 insert into base (objectid, date, price, foo) values (2, '2017-01-19', 150, 'C');
 insert into base (objectid, date, price, foo) values (2, '2017-01-20', 200, 'D');
 insert into base (objectid, date, price, foo) values (3, '2017-01-18', 400, 'E');
 insert into base (objectid, date, price, foo) values (3, '2017-01-19', 350, 'E');
 insert into base (objectid, date, price, foo) values (3, '2017-01-20', 300, 'F');

无法从开箱即可获得您想要的查询。但你可以去 在它周围。

您需要创建另一个表:

create table report (
    report text,
    price int,
    objectid int,
    date timestamp,
    foo text,
    primary key (report, price, foo)
);

-- in cassandra if you want to search for something it has to go into clustering columns
-- and price is your first goal ... foo is there just for uniqueness 
-- now you do inserts with data that you have above
-- perfectly o.k. to create multiple inserts in cassandra 
insert into report (report, objectid, date, price, foo) values ('latest', 1, '2017-01-18', 200, 'A');
insert into report (report, objectid, date, price, foo) values ('latest', 1, '2017-01-19', 300, 'A');
insert into report (report, objectid, date, price, foo) values ('latest', 1, '2017-01-20', 400, 'B');
insert into report (report, objectid, date, price, foo) values ('latest', 2, '2017-01-18', 100, 'C');
insert into report (report, objectid, date, price, foo) values ('latest', 2, '2017-01-19', 150, 'C');
insert into report (report, objectid, date, price, foo) values ('latest', 2, '2017-01-20', 200, 'D');
insert into report (report, objectid, date, price, foo) values ('latest', 3, '2017-01-18', 400, 'E');
insert into report (report, objectid, date, price, foo) values ('latest', 3, '2017-01-19', 350, 'E');
insert into report (report, objectid, date, price, foo) values ('latest', 3, '2017-01-20', 300, 'F');

这会让你回头:

select objectid, date, price, foo from report where report='latest' and price > 300 and price < 500;

 objectid | date                            | price | foo
----------+---------------------------------+-------+-----
        3 | 2017-01-18 23:00:00.000000+0000 |   350 |   E
        1 | 2017-01-19 23:00:00.000000+0000 |   400 |   B
        3 | 2017-01-17 23:00:00.000000+0000 |   400 |   E

这不是你想要的。你现在有几个选择。

基本上,如果您从主键中排除价格,您将获得:

create table report2 (
    report text,
    price int,
    objectid int,
    date timestamp,
    foo text,
    primary key (report, foo)
 );

insert into report2 (report, objectid, date, price, foo) values ('latest', 1, '2017-01-18', 200, 'A');
insert into report2 (report, objectid, date, price, foo) values ('latest', 1, '2017-01-19', 300, 'A');
insert into report2 (report, objectid, date, price, foo) values ('latest', 1, '2017-01-20', 400, 'B');
insert into report2 (report, objectid, date, price, foo) values ('latest', 2, '2017-01-18', 100, 'C');
insert into report2 (report, objectid, date, price, foo) values ('latest', 2, '2017-01-19', 150, 'C');
insert into report2 (report, objectid, date, price, foo) values ('latest', 2, '2017-01-20', 200, 'D');
insert into report2 (report, objectid, date, price, foo) values ('latest', 3, '2017-01-18', 400, 'E');
insert into report2 (report, objectid, date, price, foo) values ('latest', 3, '2017-01-19', 350, 'E');
insert into report2 (report, objectid, date, price, foo) values ('latest', 3, '2017-01-20', 300, 'F');

select objectid, date, price, foo from report2 where report='latest';

 objectid | date                            | price | foo
----------+---------------------------------+-------+-----
        1 | 2017-01-18 23:00:00.000000+0000 |   300 |   A
        1 | 2017-01-19 23:00:00.000000+0000 |   400 |   B
        2 | 2017-01-18 23:00:00.000000+0000 |   150 |   C
        2 | 2017-01-19 23:00:00.000000+0000 |   200 |   D
        3 | 2017-01-18 23:00:00.000000+0000 |   350 |   E
        3 | 2017-01-19 23:00:00.000000+0000 |   300 |   F

如果你没有太多的foo,你可以通过在客户端过滤它来逃脱它,但大多数 当时这是反模式。

您也可以使用查询:

select objectid, date, price, foo from report2 where report='latest' and price > 300 and price < 500 allow filtering;


 objectid | date                            | price | foo
----------+---------------------------------+-------+-----
        1 | 2017-01-19 23:00:00.000000+0000 |   400 |   B
        3 | 2017-01-18 23:00:00.000000+0000 |   350 |   E

哪个不理想,但它有点有用。

我最近创建分区的原因是分区保留在同一主机上。根据 在你得到的工作量上,这可能会成为你的热门话题。

这或多或少是故事的关系方面......

如果您真的使用cassandra,您必须预先准备好视图。所以你会得到报告2 但是会插入你想要出去的每个统计组的数据,即

insert into report2 (report, objectid, date, price, foo) values ('300-500', 1, '2017-01-19', 300, 'A');
... and so on

然后你会这样做:

select objectid, date, price, foo from report2 where report='300-500'

但我想你想动态设置范围,所以这不是你想要的。这或多或少是基本的cassandra所做的。

然后总是有物化的观点(目前他们有一些问题),我个人不会将它们与一些超级重要的报道一起使用。

如果访问模式未知,总会有apache spark或一些脚本解决方案来检查数据并创建所需的视图。

答案 1 :(得分:0)

create table report (
report text,
price int,
objectid int,
date timestamp,
foo text,
primary key ((report, price), foo)
);

你可以运行像

这样的查询
select * from report where token (report,price) > token('latest',200) and  token (report,price) < token('latest',300);

这应该给你200到300的价格范围。

问题更改后的修改 -

create table product_history (
description  text,
createdOn bigint,
warehouse_location  text,
color text,
price int ,
primary key (description,createdOn, warehouse_location,price)
);

使用上表保留历史记录。

列出您的查询 -

1 -  Price of all objects with warehouse_location 23A from 2017-01-12 to 2017-01-18

使用此表 -

create table product_by_latest_price (
description  text, -- I believe description of the product is not going to change, otherwise , use some UUID per product. 
createdOn bigint,
warehouse_location  text,
color text,
price int static,   -- this is shared value, per description.
primary key (description,createdOn, warehouse_location)
);

对于此表,您没有使用更新查询,只需继续插入。例如 -

  insert into product_by_latest_price (description, createdOn, warehouse_location, color,price) values (Cucumber , '2017-01-12', 23A, 'green',100);
  insert into product_by_latest_price (description, createdOn, warehouse_location, color,price) values (Cucumber , '2017-01-18', 23A, 'green',200);

选择查询

select * from product_by_latest_price where warehouse_location = 23A and createdOn>= 2017-01-12 and 2017-01-12<=2017-01-18;  result would be row with price 200.

价格&gt;所有对象的颜色60和价格&lt; 90和日期&gt; 2017-01-11和日期&lt; 2017年1月22日

create table product_by_date_price (
description  text,
createdOn bigint,
warehouse_location  text,
color text,
price int ,
primary key (createdOn,price)
);

select * from product_by_date_price where token(createdOn)> token(2017-01-12) and  token(createdOn)<(2017-01-18) and price>60  and price < 80;

这将返回重复的行,您需要在应用程序级别过滤掉。

对于插入使用批次

BEGIN BATCH 

 insert into product_history .....
 insert into product_by_latest_price .....
 insert into product_by_date_price .....

END BATCh

我还没有对任何查询进行过测试,但这样做很有用。 Cassandra 3.0有意见,你可以利用意见。记住,根据您的查询设计您的表。不要犹豫,复制数据。

祝你好运。