SpanNotQuery给出意外结果(忽略排除)

时间:2014-06-17 09:23:17

标签: lucene elasticsearch

我们在elasticsearch中遇到SpanNotQuery的一些问题。看起来忽略了查询的排除部分。

为了重现这个问题,我创建了一组文档:

  1. fiets kopen
  2. fiets lopen
  3. harrie kopen
  4. harrie lopen
  5. harrie fiets
  6. kopen lopen
  7. harrie的SpanTermQuery将导致(3,4,5)

    用于kopen的SpanTermQuery将导致(1,3,6)

    现在我想在SpanNotQuery中将它结合起来,其中include是'harrie'并且排除'kopen'

    我希望结果是(4,5),但它是(3,4,5)。

    我们必须使用SpanQueries,这只是我们遇到的麻烦的一小部分。

    我用Lucene创建了一个单元测试来显示我们的问题

    public class LuceneTest {
    
        @Test
        public void test() throws Exception {
            RAMDirectory ram = new RAMDirectory();
            createAndFillIndex(ram);
    
            DirectoryReader directoryReader = DirectoryReader.open(ram);
            IndexSearcher searcher = new IndexSearcher(directoryReader);
    
            SpanQuery include = new SpanTermQuery(new Term("dummy", "harrie"));
            SpanQuery exclude = new SpanTermQuery(new Term("dummy", "kopen"));
            Query spanNot = new SpanNotQuery(include, exclude);
    
            TopDocs search = searcher.search(spanNot, 100);
            for (ScoreDoc scoreDoc : search.scoreDocs) {
                Document result = searcher.doc(scoreDoc.doc);
                String dummy = result.get("dummy");
                System.out.println(scoreDoc.doc + ": " + dummy);
            }
    
        }
    
        private void createAndFillIndex(RAMDirectory ram) throws IOException {
            IndexWriterConfig conf = new IndexWriterConfig(Version.LUCENE_47, new SimpleAnalyzer(Version.LUCENE_47));
            IndexWriter writer = new IndexWriter(ram, conf);
    
            add(writer, "nul"); //0
            add(writer, "fiets kopen"); //1
            add(writer, "fiets lopen"); //2
            add(writer, "harrie kopen"); //3
            add(writer, "harrie lopen"); //4
            add(writer, "harrie fiets"); //5
            add(writer, "kopen lopen"); //6
    
            writer.close();
        }
    
        private void add(IndexWriter writer, String value) throws IOException {
            Document doc = new Document();
            IndexableField f = new TextField("dummy", value, Field.Store.YES);
            doc.add(f);
            writer.addDocument(doc);
        }
    
    }
    

    有谁知道我们做错了什么?

    谢谢!

1 个答案:

答案 0 :(得分:3)

文档提供了一个提示。匹配:

  

来自include的跨越与的跨度没有重叠

我们处理的是跨度,而不是整个文档。但是,简单术语查询的匹配范围只是单个术语。在您的示例中的三个匹配文档中的每一个中,匹配范围为harrie,与任何一个中的术语kopen没有任何重叠。

查看一个显示其工作方式的示例可能会更有帮助。您应该能够将以下片段复制粘贴到您的示例中(顺便说一下,感谢MCVE!)。让我们试试这个问题:

    SpanQuery include = new SpanTermQuery(new Term("dummy", "harrie"));
    SpanQuery exclude = new SpanTermQuery(new Term("dummy", "kopen"));
    SpanQuery matchterm = new SpanTermQuery(new Term("dummy", "match"));

    SpanQuery[] clauses = {include, matchterm};

    SpanQuery nearQuery = new SpanNearQuery(clauses, 2, true);

    Query spanNot = new SpanNotQuery(nearQuery, exclude);

反对这些文件:

    add(writer, "harrie kopen match"); //1
    add(writer, "harrie match kopen"); //2
    add(writer, "harrie other stuff match kopen"); //3

你应该看到2次点击。

  • 文档1:匹配nearQuery与span:" harrie kopen match"。这包含" kopen" (即与匹配exclude的范围重叠),因此SpanNotQuery

  • 消除了它
  • 文档2:匹配nearQuery与span:" harrie match"。该文档包含" kopen",但不在匹配范围内,因此文档保持匹配。

  • 文档3:匹配nearQuery与span:" marrie其他东西匹配"。同样,该文档包含" kopen",但不在匹配的范围内,因此它可以通过。

如果您希望否定在整个文档上,而不仅仅是匹配的范围,请改用BooleanQuery

SpanQuery include = new SpanTermQuery(new Term("dummy", "harrie"));
SpanQuery exclude = new SpanTermQuery(new Term("dummy", "kopen"));
Query query = new BooleanQuery();
query.add(new BooleanClause(include, BooleanClause.Occur.MUST))
query.add(new BooleanClause(exclude, BooleanClause.Occur.MUST_NOT))