我有一组用分层分类标签注释的文档, E.g。
[
{
"id": 1,
"title": "a funny book",
"authors": ["Jean Bon", "Alex Terieur"],
"book_category": "/novel/comedy/new"
},
{
"id": 2,
"title": "a dramatic book",
"authors": ["Alex Terieur"],
"book_category": "/novel/drama"
},
{
"id": 3,
"title": "A hilarious book",
"authors": ["Marc Assin", "Harry Covert"],
"book_category": "/novel/comedy"
},
{
"id": 4,
"title": "A sad story",
"authors": ["Gerard Menvusa", "Alex Terieur"],
"book_category": "/novel/drama"
},
{
"id": 5,
"title": "A very sad story",
"authors": ["Gerard Menvusa", "Alain Terieur"],
"book_category": "/novel"
}]
我需要通过“book_category”搜索图书。搜索必须完全或部分地返回与查询类别匹配的书籍(具有定义的深度阈值),并根据匹配度给予它们不同的分数。
例如:查询“book_category = / novel / comedy”和“depth_threshold = 1”必须返回book_category = / novel / comedy(得分= 100%),/ novel和/ novel / comedy / new(得分< 100%)。
我在搜索中尝试了TopScoreDocCollector,但它返回book_category至少包含查询类别的书,并给它们相同的分数。
如何获得此搜索功能,该功能还返回更一般的类别,并为结果提供不同的匹配分数?
P.S。:我不需要面对面搜索。
谢谢
答案 0 :(得分:1)
没有支持此要求的内置查询,但您可以使用DisjunctionMaxQuery
多个ConstantScoreQuery
。可以通过简单的TermQuery
搜索确切的类别和更一般的类别。对于子类别,如果您不了解子类别,则可以使用MultiTermQuery
之类的RegexpQuery
来匹配所有子类别。例如:
// the exact category
Query directQuery = new TermQuery(new Term("book_category", "/novel/comedy"));
// regex, that matches one level more that your exact category
Query narrowerQuery = new RegexpQuery(new Term("book_category", "/novel/comedy/[^/]+"));
// the more general category
Query broaderQuery = new TermQuery(new Term("book_category", "/novel"));
directQuery = new ConstantScoreQuery(directQuery);
narrowerQuery = new ConstantScoreQuery(narrowerQuery);
broaderQuery = new ConstantScoreQuery(broaderQuery);
// 100% for the exact category
directQuery.setBoost(1.0F);
// 80% for the more specific category
narrowerQuery.setBoost(0.8F);
// 50% for the more general category
broaderQuery.setBoost(0.5F);
DisjunctionMaxQuery query = new DisjunctionMaxQuery(0.0F);
query.add(directQuery);
query.add(narrowerQuery);
query.add(broaderQuery);
这会产生如下结果:
id=3 title=a hilarious book book_category=/novel/comedy score=1.000000
id=1 title=a funny book book_category=/novel/comedy/new score=0.800000
id=5 title=A very sad story book_category=/novel score=0.500000
有关完整的测试用例,请参阅此要点:https://gist.github.com/knutwalker/7959819
答案 1 :(得分:0)
这可以通过解决方案。但我有一个以上的层次结构提交查询,我想使用分类法索引的CategoryPath。 我正在使用DrillDown查询:
DrillDownQuery luceneQuery = new DrillDownQuery(searchParams.indexingParams);
luceneQuery.add(new CategoryPath("book_category/novel/comedy,'/'));
luceneQuery.add(new CategoryPath("subject/sub1/sub2",'/'));
通过这种方式,搜索返回书籍如何匹配两个类别路径及其后代。 为了检索祖先,我可以从所请求的categoryPath的祖先(从分类中检索)开始向下钻取。
所有结果的问题都是相同的。 我想覆盖相似性/得分函数,以便计算基于categoryPath长度的分数,将查询categoryPath与每个返回的文档CategoryPath(book_category)进行比较。
例如:
if(queryCategoryPath.compareTo(bookCategoryPath)==0){
document.score = 1
}else if(queryCategoryPath.compareTo(bookCategoryPath)==1){
document.score = 0.9
}else if(queryCategoryPath.compareTo(bookCategoryPath)==2){
document.score = 0.8
} and so on.