Question

所以，我原来的问题是使用token（）函数来翻阅Cassandra 1.2.9中的大型数据集，如下所述：Paging large resultsets in Cassandra with CQL3 with varchar keys

接受的答案让选择使用令牌和块大小，但另一个问题表现出来。

我的表在cqlsh中看起来像这样：

key           | column1               | value
---------------+-----------------------+-------
  85.166.4.140 |       county_finnmark |     4
  85.166.4.140 |       county_id_20020 |     4
  85.166.4.140 |     municipality_alta |     2
  85.166.4.140 | municipality_id_20441 |     2
 93.89.124.241 |        county_hedmark |    24
 93.89.124.241 |       county_id_20005 |    24

主键是key和column1的组合。在CLI中，相同的数据如下所示：

get ip['85.166.4.140'];
=> (counter=county_finnmark, value=4)
=> (counter=county_id_20020, value=4)
=> (counter=municipality_alta, value=2)
=> (counter=municipality_id_20441, value=2)
Returned 4 results.

问题

当使用限制为100的cql时，返回的结果可能会在记录中间停止，如下所示：

key           | column1               | value
---------------+-----------------------+-------
  85.166.4.140 |       county_finnmark |     4
  85.166.4.140 |       county_id_20020 |     4

将这些留给＆＃34;行＆＃34; （列）出：

  85.166.4.140 |     municipality_alta |     2
  85.166.4.140 | municipality_id_20441 |     2

现在，当我对下一页使用token（）函数时，会跳过这两行：

select * from ip where token(key) > token('85.166.4.140') limit 10;

结果：

key           | column1                | value
---------------+------------------------+-------
 93.89.124.241 |         county_hedmark |    24
 93.89.124.241 |        county_id_20005 |    24
 95.169.53.204 |        county_id_20006 |     2
 95.169.53.204 |         county_oppland |     2

因此，没有跟踪前两个IP地址的最后两个结果。

问题

如何在不跳过cql行的情况下使用token（）进行分页？类似的东西：

select * from ip where token(key) > token(key:column1) limit 10;

Answer 1

好的，所以我使用这篇文章中的信息来制定解决方案： http://www.datastax.com/dev/blog/cql3-table-support-in-hadoop-pig-and-hive （＆＃34; CQL3分页＆＃34;）。

首先，我执行这个cql：

select * from ip limit 5000;

从结果集的最后一行，我得到密钥（即＆＃39; 85.166.4.140＆＃39;）和column1的值（即＆＃39; county_id_20020＆＃39;）。

然后我创建一个评估为

的预准备语句

select * from ip where token(key) = token('85.166.4.140') and column1 > 'county_id_20020' ALLOW FILTERING;

（我猜测它也可以在不使用token（）函数的情况下工作，因为现在检查相等：）

select * from ip where key = '85.166.4.140' and column1 > 'county_id_20020' ALLOW FILTERING;

结果集现在包含此IP的剩余X行（列）。然后该方法返回所有行，并且对该方法的下一次调用包括最后使用的键（＆＃39; 85.166.4.140＆＃39;）。使用此键，我可以执行以下选择：

select * from ip where token(key) > token('85.166.4.140') limit 5000;

在＆＃39; 85.166.4.140＆＃39;

之后的第一个IP 中提供了接下来的5000行（包括）。

现在，分页中没有列丢失。

<强>更新

Cassandra 2.0引入了自动分页，由客户端处理。更多信息：http://www.datastax.com/dev/blog/client-side-improvements-in-cassandra-2-0

（请注意，setFetchSize是可选的，不需要分页工作）

使用复合主键在Cassandra中分页结果集 - 在行上丢失

1 个答案: