按照包含1亿条记录的Nosql中的值或条件值搜索记录

时间:2018-04-16 16:16:36

标签: cassandra nosql bigdata aerospike

我们正在寻找NoSQL数据库,我们可以在Redis中存储超过1亿条带有多个字段的记录。

数据库应该可以搜索到值。我们检查了Redis,但它不支持任何按值搜索的选项。因为我们有数百万条记录,我们更新了一些记录字段,然后记录了一堆未在特定时间更新的记录。

因此,对所有记录运行查询,然后检查哪些记录不会从特定时间更新需要更多时间。因为在这个解决方案中,我们每分钟更新100-200个记录,然后根据值进行记录。

所以,Redis不会在这里工作。我们可以选择存储到MongoDB中,但我们正在寻找支持按价值搜索功能的键值数据库。

{ 
    "_id" : ObjectId("5ac72e522188c962d024d0cd"), 
    "itemId" : 11.0, 
    "url" : "http://www.testurl.com", 
    "failed" : 0.0, 
    "proxyProvider" : "Test", 
    "isLocked" : false, 
    "syncDurationInMinute" : 60.0, 
    "lastUpdatedTimeUTC" : "", 
    "nextUpdateTimeUTC" : "", 
    "targetCountry" : "US", 
    "requestContentType" : "JSON", 
    "group" : "US"
}

1 个答案:

答案 0 :(得分:2)

In Aerospike, you can use predicate filtering to find records that have not been updated since a point in time, and return only the metadata of that record, which includes the record digest (its unique identifier). You can process the matched digests and do whatever update you need to do. This type of predicate filter is very fast because it only has to look at the primary index entry, which is kept in memory. See the examples in the Java client's repo.

You would not need to use a secondary index here, because you want to scan all the records in a namespace (or set of that namespace) and just check the 'last-update-time' piece of metadata of each record. Since you'll be returning just the record's digest (unique ID) and not any of its actual data, this scan will never need to read anything from SSD. It'll be very fast and lightweight on the results (again, only metadata is sent back). In the client you'll iterate the result set, build a list of IDs and then act on those with a subsequent write.