我们正在MongoDB之上构建一个简化版的搜索引擎。
样本数据集
{ "_id" : 1, "dept" : "tech", "updDate": ISODate("2014-08-27T09:45:35Z"), "description" : "lime green computer" }
{ "_id" : 2, "dept" : "tech", "updDate": ISODate("2014-07-27T09:45:35Z"), "description" : "wireless red mouse" }
{ "_id" : 3, "dept" : "kitchen", "updDate": ISODate("2014-04-27T09:45:35Z"), "description" : "green placemat" }
{ "_id" : 4, "dept" : "kitchen", "updDate": ISODate("2014-05-27T09:45:35Z"), "description" : "red peeler" }
{ "_id" : 5, "dept" : "food", "updDate": ISODate("2014-04-27T09:45:35Z"), "description" : "green apple" }
{ "_id" : 6, "dept" : "food", "updDate": ISODate("2014-01-27T09:45:35Z"), "description" : "red potato" }
{ "_id" : 7, "dept" : "food", "updDate": ISODate("2014-08-28T09:45:35Z"), "description" : "lime green computer" }
{ "_id" : 8, "dept" : "food", "updDate": ISODate("2014-08-27T09:45:35Z"), "description" : "lime green computer" }
{ "_id" : 9, "dept" : "food", "updDate": ISODate("2014-08-27T09:45:35Z"), "description" : "lime green computer" }
我们希望避免使用“offset-limit”来对结果进行分页,为了做到这一点,我们基本上通过修改查询的“where / match”子句来使用“搜索方法”,以便能够使用索引而不是迭代集合来获取所需的结果。 有关“搜索方法”的更多信息,我强烈建议您阅读http://use-the-index-luke.com/blog/2013-07/pagination-done-the-postgresql-way
搜索引擎通常按分数和更新日期按后代顺序排序结果。为实现这一点,我们在聚合管道中使用文本搜索功能,如下所示。
db.inventory.createIndex({description:"text", dept: -1, updDate: -1, id:-1})
第一页
db.inventory.aggregate( [ { $match: { dept : {$in : ["food","kitchen"]},"$text" : { "$language" : "en", "$search" : "green"} } },{ $project: {score: { $meta: "textScore" }, description : 1, updDate : 1, _id: 1 } }, { $sort: { "score" : -1, "updDate" : -1, _id: -1 } }, {$limit: 2 }] )
{ "_id" : 5, "updDate" : ISODate("2014-04-27T09:45:35Z"), "description" : "green apple", "score" : 0.75 }
{ "_id" : 3, "updDate" : ISODate("2014-04-27T09:45:35Z"), "description" : "green placemat", "score" : 0.75 }
第二页
db.inventory.aggregate( [ { $match: { dept : {$in : ["food","kitchen"]},"$text" : { "$language" : "en", "$search" : "green"} } },{ $project: {score: { $meta: "textScore" }, description : 1, updDate : 1, _id: 1 } }, { $sort: { "score" : -1, "updDate" : -1, _id: -1 } }, { "$match" : { "$or" : [ { "score" : { "$lt" : 0.75}} , { "$and" : [ { "score" : { "$eq" : 0.75}} , { "$or" : [ { "updDate" : { "$lt" : ISODate("2014-04-27T09:45:35Z")}},{ "$and" : [ { "updDate": { "$eq" : ISODate("2014-04-27T09:45:35Z")}} , { "_id" : { "$lt" : 3}}]}]}]}]}},{$limit: 2 }] )
{ "_id" : 7, "updDate" : ISODate("2014-08-28T09:45:35Z"), "description" : "lime green computer", "score" : 0.6666666666666666 }
{ "_id" : 9, "updDate" : ISODate("2014-08-27T09:45:35Z"), "description" : "lime green computer", "score" : 0.6666666666666666 }
最后一页
db.inventory.aggregate( [ { $match: { dept : {$in : ["food","kitchen"]} , "$text" : { "$language" : "en", "$search" : "green"} } }, { $project: {score: { $meta: "textScore" }, description : 1, updDate : 1, _id: 1 } }, { $sort: { "score" : -1, "updDate" : -1, _id: -1 } }, { "$match" : { "$or" : [ { "score" : { "$lt" : 0.6666666666666666}} , { "$and" : [ { "score" : { "$eq" : 0.6666666666666666}} , { "$or" : [ { "updDate" : { "$lt" : ISODate("2014-08-27T09:45:35Z")}} , { "$and" : [ { "updDate" : { "$eq" : ISODate("2014-08-27T09:45:35Z")}} , { "_id" : { "$lt" : 9}}]}]}]}]}}, {$limit: 2 }] )
{ "_id" : 8, "updDate" : ISODate("2014-08-27T09:45:35Z"), "description" : "lime green computer", "score" : 0.6666666666666666 }
注意我们如何通过score,updDate和id排序结果,在第二个匹配阶段我们如何尝试使用文档的分数值,更新日期和最终ID来对它们进行分页。
索引创建考虑到文本查询无法覆盖文本索引前缀字段,请参阅问题https://jira.mongodb.org/browse/SERVER-13018,尽管我不确定是否适用于我们的案例。
由于“executionStats”和“allPlansExecution”模式在聚合框架中不起作用,请参阅https://jira.mongodb.org/browse/SERVER-19758我不知道MongoDB如何尝试解析查询。
由于索引交集不适用于文本搜索,请参阅https://jira.mongodb.org/browse/SERVER-3071(已在2.5.5解析)和http://blog.mongodb.org/post/87790974798/efficient-indexing-in-mongodb-26作者所说的
As of version 2.6.0, you cannot intersect with geo or text indices and you can intersect at most 2 separate indices with each other. These limitations are likely to change in a future release.
在阅读了https://docs.mongodb.org/manual/MongoDB-indexes-guide-master.pdf的第3.4节(文本检索教程)和3.5(索引策略)之后,没有得出任何明确的结论。
那么从文本搜索角度索引此集合的最佳索引策略是什么?
第一个匹配阶段的一个索引和第二个(分页)匹配阶段的另一个索引?
db.inventory.createIndex({description:"text", dept: -1})
db.inventory.createIndex({updDate: -1, id:-})
考虑两个匹配阶段的字段的复合索引?
db.inventory.createIndex({description:"text", dept: -1, updDate: -1, id:-1})
以上都没有?
由于