我们在elasticsearch中遇到SpanNotQuery的一些问题。看起来忽略了查询的排除部分。
为了重现这个问题,我创建了一组文档:
harrie的SpanTermQuery将导致(3,4,5)
用于kopen的SpanTermQuery将导致(1,3,6)
现在我想在SpanNotQuery中将它结合起来,其中include是'harrie'并且排除'kopen'
我希望结果是(4,5),但它是(3,4,5)。
我们必须使用SpanQueries,这只是我们遇到的麻烦的一小部分。
我用Lucene创建了一个单元测试来显示我们的问题
public class LuceneTest {
@Test
public void test() throws Exception {
RAMDirectory ram = new RAMDirectory();
createAndFillIndex(ram);
DirectoryReader directoryReader = DirectoryReader.open(ram);
IndexSearcher searcher = new IndexSearcher(directoryReader);
SpanQuery include = new SpanTermQuery(new Term("dummy", "harrie"));
SpanQuery exclude = new SpanTermQuery(new Term("dummy", "kopen"));
Query spanNot = new SpanNotQuery(include, exclude);
TopDocs search = searcher.search(spanNot, 100);
for (ScoreDoc scoreDoc : search.scoreDocs) {
Document result = searcher.doc(scoreDoc.doc);
String dummy = result.get("dummy");
System.out.println(scoreDoc.doc + ": " + dummy);
}
}
private void createAndFillIndex(RAMDirectory ram) throws IOException {
IndexWriterConfig conf = new IndexWriterConfig(Version.LUCENE_47, new SimpleAnalyzer(Version.LUCENE_47));
IndexWriter writer = new IndexWriter(ram, conf);
add(writer, "nul"); //0
add(writer, "fiets kopen"); //1
add(writer, "fiets lopen"); //2
add(writer, "harrie kopen"); //3
add(writer, "harrie lopen"); //4
add(writer, "harrie fiets"); //5
add(writer, "kopen lopen"); //6
writer.close();
}
private void add(IndexWriter writer, String value) throws IOException {
Document doc = new Document();
IndexableField f = new TextField("dummy", value, Field.Store.YES);
doc.add(f);
writer.addDocument(doc);
}
}
有谁知道我们做错了什么?
谢谢!
答案 0 :(得分:3)
文档提供了一个提示。匹配:
来自include的跨越与的跨度没有重叠
我们处理的是跨度,而不是整个文档。但是,简单术语查询的匹配范围只是单个术语。在您的示例中的三个匹配文档中的每一个中,匹配范围为harrie
,与任何一个中的术语kopen
没有任何重叠。
查看一个显示其工作方式的示例可能会更有帮助。您应该能够将以下片段复制粘贴到您的示例中(顺便说一下,感谢MCVE!)。让我们试试这个问题:
SpanQuery include = new SpanTermQuery(new Term("dummy", "harrie"));
SpanQuery exclude = new SpanTermQuery(new Term("dummy", "kopen"));
SpanQuery matchterm = new SpanTermQuery(new Term("dummy", "match"));
SpanQuery[] clauses = {include, matchterm};
SpanQuery nearQuery = new SpanNearQuery(clauses, 2, true);
Query spanNot = new SpanNotQuery(nearQuery, exclude);
反对这些文件:
add(writer, "harrie kopen match"); //1
add(writer, "harrie match kopen"); //2
add(writer, "harrie other stuff match kopen"); //3
你应该看到2次点击。
文档1:匹配nearQuery
与span:" harrie kopen match"。这包含" kopen" (即与匹配exclude
的范围重叠),因此SpanNotQuery
文档2:匹配nearQuery
与span:" harrie match"。该文档包含" kopen",但不在匹配范围内,因此文档保持匹配。
文档3:匹配nearQuery
与span:" marrie其他东西匹配"。同样,该文档包含" kopen",但不在匹配的范围内,因此它可以通过。
如果您希望否定在整个文档上,而不仅仅是匹配的范围,请改用BooleanQuery
。
SpanQuery include = new SpanTermQuery(new Term("dummy", "harrie"));
SpanQuery exclude = new SpanTermQuery(new Term("dummy", "kopen"));
Query query = new BooleanQuery();
query.add(new BooleanClause(include, BooleanClause.Occur.MUST))
query.add(new BooleanClause(exclude, BooleanClause.Occur.MUST_NOT))