Question

我有一组用分层分类标签注释的文档， E.g。

[
{
    "id": 1,
    "title": "a funny book",
    "authors": ["Jean Bon", "Alex Terieur"],
    "book_category": "/novel/comedy/new"
},
{
    "id": 2,
    "title": "a dramatic book",
    "authors": ["Alex Terieur"],
    "book_category": "/novel/drama"
},
{
    "id": 3,
    "title": "A hilarious book",
    "authors": ["Marc Assin", "Harry Covert"],
    "book_category": "/novel/comedy"
},
{
    "id": 4,
    "title": "A sad story",
    "authors": ["Gerard Menvusa", "Alex Terieur"],
    "book_category": "/novel/drama"
},
{
    "id": 5,
    "title": "A very sad story",
    "authors": ["Gerard Menvusa", "Alain Terieur"],
    "book_category": "/novel"
}]

我需要通过“book_category”搜索图书。搜索必须完全或部分地返回与查询类别匹配的书籍（具有定义的深度阈值），并根据匹配度给予它们不同的分数。

例如：查询“book_category = / novel / comedy”和“depth_threshold = 1”必须返回book_category = / novel / comedy（得分= 100％），/ novel和/ novel / comedy / new（得分＆lt; 100％）。

我在搜索中尝试了TopScoreDocCollector，但它返回book_category至少包含查询类别的书，并给它们相同的分数。

如何获得此搜索功能，该功能还返回更一般的类别，并为结果提供不同的匹配分数？

P.S。：我不需要面对面搜索。

谢谢

Answer 1

没有支持此要求的内置查询，但您可以使用DisjunctionMaxQuery多个ConstantScoreQuery。可以通过简单的TermQuery搜索确切的类别和更一般的类别。对于子类别，如果您不了解子类别，则可以使用MultiTermQuery之类的RegexpQuery来匹配所有子类别。例如：

// the exact category
Query directQuery = new TermQuery(new Term("book_category", "/novel/comedy"));
// regex, that matches one level more that your exact category
Query narrowerQuery = new RegexpQuery(new Term("book_category", "/novel/comedy/[^/]+"));
// the more general category
Query broaderQuery = new TermQuery(new Term("book_category", "/novel"));

directQuery = new ConstantScoreQuery(directQuery);
narrowerQuery = new ConstantScoreQuery(narrowerQuery);
broaderQuery = new ConstantScoreQuery(broaderQuery);

// 100% for the exact category
directQuery.setBoost(1.0F);
// 80% for the more specific category
narrowerQuery.setBoost(0.8F);
// 50% for the more general category
broaderQuery.setBoost(0.5F);

DisjunctionMaxQuery query = new DisjunctionMaxQuery(0.0F);

query.add(directQuery);
query.add(narrowerQuery);
query.add(broaderQuery);

这会产生如下结果：

id=3 title=a hilarious book book_category=/novel/comedy score=1.000000
id=1 title=a funny book book_category=/novel/comedy/new score=0.800000
id=5 title=A very sad story book_category=/novel score=0.500000

有关完整的测试用例，请参阅此要点：https://gist.github.com/knutwalker/7959819

Answer 2

这可以通过解决方案。但我有一个以上的层次结构提交查询，我想使用分类法索引的CategoryPath。我正在使用DrillDown查询：

DrillDownQuery luceneQuery = new DrillDownQuery(searchParams.indexingParams); 
luceneQuery.add(new CategoryPath("book_category/novel/comedy,'/')); 
luceneQuery.add(new CategoryPath("subject/sub1/sub2",'/'));

通过这种方式，搜索返回书籍如何匹配两个类别路径及其后代。为了检索祖先，我可以从所请求的categoryPath的祖先（从分类中检索）开始向下钻取。

所有结果的问题都是相同的。我想覆盖相似性/得分函数，以便计算基于categoryPath长度的分数，将查询categoryPath与每个返回的文档CategoryPath（book_category）进行比较。

例如：

if(queryCategoryPath.compareTo(bookCategoryPath)==0){ 
    document.score = 1 
}else if(queryCategoryPath.compareTo(bookCategoryPath)==1){ 
    document.score = 0.9 
}else if(queryCategoryPath.compareTo(bookCategoryPath)==2){ 
    document.score = 0.8 
} and so on.

Lucene Hierarchial Taxonomy Search

2 个答案: