Lucene Hierarchial Taxonomy Search

时间:2013-12-13 12:56:36

标签: search lucene taxonomy

我有一组用分层分类标签注释的文档, E.g。

[
{
    "id": 1,
    "title": "a funny book",
    "authors": ["Jean Bon", "Alex Terieur"],
    "book_category": "/novel/comedy/new"
},
{
    "id": 2,
    "title": "a dramatic book",
    "authors": ["Alex Terieur"],
    "book_category": "/novel/drama"
},
{
    "id": 3,
    "title": "A hilarious book",
    "authors": ["Marc Assin", "Harry Covert"],
    "book_category": "/novel/comedy"
},
{
    "id": 4,
    "title": "A sad story",
    "authors": ["Gerard Menvusa", "Alex Terieur"],
    "book_category": "/novel/drama"
},
{
    "id": 5,
    "title": "A very sad story",
    "authors": ["Gerard Menvusa", "Alain Terieur"],
    "book_category": "/novel"
}]

我需要通过“book_category”搜索图书。搜索必须完全或部分地返回与查询类别匹配的书籍(具有定义的深度阈值),并根据匹配度给予它们不同的分数。

例如:查询“book_category = / novel / comedy”和“depth_threshold = 1”必须返回book_category = / novel / comedy(得分= 100%),/ novel和/ novel / comedy / new(得分< 100%)。

我在搜索中尝试了TopScoreDocCollector,但它返回book_category至少包含查询类别的书,并给它们相同的分数。

如何获得此搜索功能,该功能还返回更一般的类别,并为结果提供不同的匹配分数?

P.S。:我不需要面对面搜索。

谢谢

2 个答案:

答案 0 :(得分:1)

没有支持此要求的内置查询,但您可以使用DisjunctionMaxQuery多个ConstantScoreQuery。可以通过简单的TermQuery搜索确切的类别和更一般的类别。对于子类别,如果您不了解子类别,则可以使用MultiTermQuery之类的RegexpQuery来匹配所有子类别。例如:

// the exact category
Query directQuery = new TermQuery(new Term("book_category", "/novel/comedy"));
// regex, that matches one level more that your exact category
Query narrowerQuery = new RegexpQuery(new Term("book_category", "/novel/comedy/[^/]+"));
// the more general category
Query broaderQuery = new TermQuery(new Term("book_category", "/novel"));

directQuery = new ConstantScoreQuery(directQuery);
narrowerQuery = new ConstantScoreQuery(narrowerQuery);
broaderQuery = new ConstantScoreQuery(broaderQuery);

// 100% for the exact category
directQuery.setBoost(1.0F);
// 80% for the more specific category
narrowerQuery.setBoost(0.8F);
// 50% for the more general category
broaderQuery.setBoost(0.5F);

DisjunctionMaxQuery query = new DisjunctionMaxQuery(0.0F);

query.add(directQuery);
query.add(narrowerQuery);
query.add(broaderQuery);

这会产生如下结果:

id=3 title=a hilarious book book_category=/novel/comedy score=1.000000
id=1 title=a funny book book_category=/novel/comedy/new score=0.800000
id=5 title=A very sad story book_category=/novel score=0.500000

有关完整的测试用例,请参阅此要点:https://gist.github.com/knutwalker/7959819

答案 1 :(得分:0)

这可以通过解决方案。但我有一个以上的层次结构提交查询,我想使用分类法索引的CategoryPath。 我正在使用DrillDown查询:

DrillDownQuery luceneQuery = new DrillDownQuery(searchParams.indexingParams); 
luceneQuery.add(new CategoryPath("book_category/novel/comedy,'/')); 
luceneQuery.add(new CategoryPath("subject/sub1/sub2",'/')); 

通过这种方式,搜索返回书籍如何匹配两个类别路径及其后代。 为了检索祖先,我可以从所请求的categoryPath的祖先(从分类中检索)开始向下钻取。

所有结果的问题都是相同的。 我想覆盖相似性/得分函数,以便计算基于categoryPath长度的分数,将查询categoryPath与每个返回的文档CategoryPath(book_category)进行比较。

例如:

if(queryCategoryPath.compareTo(bookCategoryPath)==0){ 
    document.score = 1 
}else if(queryCategoryPath.compareTo(bookCategoryPath)==1){ 
    document.score = 0.9 
}else if(queryCategoryPath.compareTo(bookCategoryPath)==2){ 
    document.score = 0.8 
} and so on.