当所有术语都出现在父文档或子文档中时,Lucene BlockJoin查询匹配

时间:2019-06-04 05:49:37

标签: lucene

我使用“块”填充了带有父文档和子文档的索引。即使用IndexWriter.addAll()方法添加文档,最后一个文档为父文档。

此刻,我仅成功搜索了“块”,其中查询中的任何词出现在父项或子项中。这给了我偏斜的结果。例如我得到了最好的结果,其中只有一个术语在“块”中多次出现,而其他术语根本没有出现。

我想搜索“块”,其中查询中的所有所有词都必须出现在父项或子项中。

但是我不确定如何构造查询。

我当前的查询代码如下:

Analyzer analyzer = new EnglishAnalyzer();
//Note, both parent and child docs have a 'textContent' field
QueryParser queryParser = new QueryParser("textContent", analyzer);
Directory index = FSDirectory.open(Paths.get("${indexParentDir}/${name}.lucene"));
BitSetProducer parentsFilter = new QueryBitSetProducer(new TermQuery(new Term("child", "N")));

Query textQuery = queryParser.parse("foo bar");

//Construct child query
BooleanQuery.Builder childQueryBuilder = new BooleanQuery.Builder();
childQueryBuilder.add(new BooleanClause(textQuery, BooleanClause.Occur.MUST));
childQueryBuilder.add(new BooleanClause(new TermQuery(new Term("child", "Y")), BooleanClause.Occur.MUST));
Query childQuery = new ToParentBlockJoinQuery(childQueryBuilder.build(), parentsFilter, ScoreMode.Avg);

//Construct parent query
BooleanQuery.Builder parentQueryBuilder = new BooleanQuery.Builder();
parentQueryBuilder.add(new BooleanClause(textQuery, BooleanClause.Occur.MUST));
parentQueryBuilder.add(new BooleanClause(new TermQuery(new Term("child", "N")), BooleanClause.Occur.MUST));

//Construct join of child and parent query
BooleanQuery.Builder childAndParentQueryBuilder = new BooleanQuery.Builder();
childAndParentQueryBuilder.add(new BooleanClause(childQuery, BooleanClause.Occur.SHOULD));
childAndParentQueryBuilder.add(new BooleanClause(parentQueryBuilder.build(), BooleanClause.Occur.SHOULD));
Query childAndParentQuery = childAndParentQueryBuilder.build();

//Run the query
DirectoryReader reader = DirectoryReader.open(index);
CheckJoinIndex.check(reader, parentsFilter);
IndexSearcher searcher = new IndexSearcher(reader);
searcher.search(childAndParentQuery, 10);

上面的代码将返回最佳结果,因此其中一个术语会多次出现。例如如果“ foo”在父文档或子文档中出现100次。但是“ bar”根本没有出现。

我只想返回所有所有术语(例如'foo'和'bar')出现在父项或其子项中的结果。

一种选择是在“父文档”中创建一个字段,该字段是父文档和子文档中所有textContent字段的集合,并且仅在新的聚合字段中进行搜索。但是这些索引已经很大。 (例如50GB)。而且我仍然需要出于显示目的将textContent在父级和子级中分开,因此创建一个聚合字段几乎会使索引大小增加一倍。

任何帮助将不胜感激。

1 个答案:

答案 0 :(得分:0)

我通过使用DisjunctionMaxQuery而不是BooleanQuery来将父查询和子查询连接在一起解决了这个问题。

从文档中:

  

...我们希望主要分数是与最高分数相关的分数   提高,而不是字段得分的总和(如BooleanQuery所给出的)。   如果查询为“白化大象”,则可确保“白化”匹配   一个字段和匹配另一个字段的“大象”比   匹配两个字段的“白化” ...

更新的代码:

Analyzer analyzer = new EnglishAnalyzer();

//Note, both parent and child docs have a 'textContent' field
QueryParser queryParser = new QueryParser("textContent", analyzer);
Directory index = FSDirectory.open(Paths.get("${indexParentDir}/${name}.lucene"));
BitSetProducer parentsFilter = new QueryBitSetProducer(new TermQuery(new Term("child", "N")));

Query textQuery = queryParser.parse("foo bar");

//Construct child query
BooleanQuery.Builder childQueryBuilder = new BooleanQuery.Builder();
childQueryBuilder.add(new BooleanClause(textQuery, BooleanClause.Occur.MUST));
childQueryBuilder.add(new BooleanClause(new TermQuery(new Term("child", "Y")), BooleanClause.Occur.MUST));
Query childQuery = new ToParentBlockJoinQuery(childQueryBuilder.build(), parentsFilter, ScoreMode.Avg);

//Construct parent query
BooleanQuery.Builder parentQueryBuilder = new BooleanQuery.Builder();
parentQueryBuilder.add(new BooleanClause(textQuery, BooleanClause.Occur.MUST));
parentQueryBuilder.add(new BooleanClause(new TermQuery(new Term("child", "N")), BooleanClause.Occur.MUST));
Query parentQuery = parentQueryBuilder.build();

//Construct join of child and parent query
Query childAndParentQuery = new DisjunctionMaxQuery(Arrays.asList(childQuery, parentQuery), 0.5f);

//Run the query
DirectoryReader reader = DirectoryReader.open(index);
CheckJoinIndex.check(reader, parentsFilter);
IndexSearcher searcher = new IndexSearcher(reader);
searcher.search(childAndParentQuery, 10);