Question

我目前正在编写一个程序，目前使用elasticsearch作为后端数据库/搜索索引。我想模仿目前使用匹配查询的/_search endpoint的功能：

{
    "query": {
        "match" : {
            "message" : "Neural Disruptor"
        }
    }
}

进行一些示例查询，在大量World of Warcraft database上产生了以下结果：

   Search Term          Search Result      
------------------ ----------------------- 
 Neural Disruptor   Neural Needler         
 Lovly bracelet     Ruby Bracelet          
 Lovely bracelet    Lovely Charm Bracelet

在查看elasticsearch的文档后，我发现匹配查询相当复杂。在java中使用lucene模拟匹配查询的最简单方法是什么？（它似乎做了一些模糊匹配，以及寻找术语）

导入MatchQuery的弹性搜索代码（我相信org.elasticsearch.index.search.MatchQuery）似乎并不那么容易。它被大量嵌入到Elasticsearch中，并且看起来不像是可以轻松拔出的东西。

我不需要完整的证据“必须与弹性搜索匹配的内容完全匹配”，我只需要一些接近的东西，或者可以模糊匹配/找到最佳匹配。

Answer 1

发送到q=端点的_search参数的任何内容均由query_string查询（不是org.elasticsearch.index.search.MatchQuery）按原样使用，该查询了解Lucene expression syntax 。

使用JavaCC在Lucene项目中定义查询解析器语法，如果您想查看，可以找到here语法。最终产品是一个名为QueryParser的类（见下文）。

负责解析查询字符串的ES源代码中的类是QueryStringQueryParser，它委托给Lucene的QueryParser类（由JavaCC生成）。

所以基本上，如果你得到一个等价的查询字符串作为传递给_search?q=...的字符串，那么你可以将该查询字符串与QueryParser.parse("query-string-goes-here")一起使用，并使用Lucene运行具体化的Query。

Answer 2

因为我直接与lucene合作已经有一段时间了，但是你最初应该是相当直接的。 lucene查询的基本行为与匹配查询非常相似（query_string完全等同于lucene，但匹配非常接近）。如果你想尝试一下，我把small example和lucene（7.2.1）一起放在一起。主要代码如下：

public static void main(String[] args) throws Exception {
    // Create the in memory lucence index
    RAMDirectory ramDir = new RAMDirectory();

    // Create the analyzer (has default stop words)
    Analyzer analyzer = new StandardAnalyzer();

    // Create a set of documents to work with
    createDocs(ramDir, analyzer);

    // Query the set of documents
    queryDocs(ramDir, analyzer);
}

private static void createDocs(RAMDirectory ramDir, Analyzer analyzer) 
        throws IOException {
    // Setup the configuration for the index
    IndexWriterConfig config = new IndexWriterConfig(analyzer);
    config.setOpenMode(IndexWriterConfig.OpenMode.CREATE);

    // IndexWriter creates and maintains the index
    IndexWriter writer = new IndexWriter(ramDir, config);

    // Create the documents
    indexDoc(writer, "document-1", "hello planet mercury");
    indexDoc(writer, "document-2", "hi PLANET venus");
    indexDoc(writer, "document-3", "howdy Planet Earth");
    indexDoc(writer, "document-4", "hey planet MARS");
    indexDoc(writer, "document-5", "ayee Planet jupiter");

    // Close down the writer
    writer.close();
}

private static void indexDoc(IndexWriter writer, String name, String content) 
        throws IOException {
    Document document = new Document();
    document.add(new TextField("name", name, Field.Store.YES));
    document.add(new TextField("body", content, Field.Store.YES));

    writer.addDocument(document);
}

private static void queryDocs(RAMDirectory ramDir, Analyzer analyzer) 
        throws IOException, ParseException {
    // IndexReader maintains access to the index
    IndexReader reader = DirectoryReader.open(ramDir);

    // IndexSearcher handles searching of an IndexReader
    IndexSearcher searcher = new IndexSearcher(reader);

    // Setup a query
    QueryParser parser = new QueryParser("body", analyzer);
    Query query = parser.parse("hey earth");

    // Search the index
    TopDocs foundDocs = searcher.search(query, 10);
    System.out.println("Total Hits: " + foundDocs.totalHits);

    for (ScoreDoc scoreDoc : foundDocs.scoreDocs) {
        // Get the doc from the index by id
        Document document = searcher.doc(scoreDoc.doc);
        System.out.println("Name: " + document.get("name") 
                + " - Body: " + document.get("body") 
                + " - Score: " + scoreDoc.score);
    }

    // Close down the reader
    reader.close();
}

扩展此功能的重要部分将是analyzer并理解lucene query parser syntax。

索引和查询都使用Analyzer来告诉如何解析文本，以便他们能够以相同的方式思考文本。它设置了如何标记（分割什么，是否toLower（）等）。 StandardAnalyzer分隔空格和其他一些（我没有这个方便），并且看起来也适用于下（）。

QueryParser将为您完成一些工作。如果你在我的例子中看到上文。我做两件事，我告诉解析器默认字段是什么，我传递一串hey earth。解析器将把它变成一个看起来像body:hey body:earth的查询。这将查找hey中包含earth或body的文档。将找到两份文件。

如果我们要传递hey AND earth，则会将查询解析为+body:hey +body:earth，这将要求docs同时拥有这两个术语。将找到零文件。

要应用模糊选项，请为要模糊的术语添加~。因此，如果查询为hey~ earth，则会将模糊性应用于hey，查询将显示为body:hey~2 body:earth。将找到三份文件。

您可以更直接地编写查询，解析器仍然处理事情。因此，如果您传递hey name:\"document-1\"（它-上的标记分割），它将创建一个类似body:hey name:"document 1"的查询。在查找短语document 1时将返回两个文档（因为它仍在-上进行标记）。如果我hey name:document-1，它会写body:hey (name:document name:1)，它会返回所有文档，因为它们都有document作为术语。这里有一些细微差别。

我将尝试更多地介绍它们的相似之处。引用match query。 Elastic说主要区别是，“它不支持字段名称前缀，通配符或其他”高级“功能。”这些可能会更加突出另一个方向。

匹配查询和lucene查询在处理分析字段时都将获取查询字符串并将分析器应用于它（将其标记为“toLower”等）。因此，他们会将HEY Earth转换为查找术语hey或earth的查询。

匹配查询可以通过提供operator来设置"operator" : "and"。这会更改我们的查询以查找hey和earth。 lucene中的类比是做parser.setDefaultOperator(QueryParser.Operator.AND);

之类的事情

接下来是fuzziness。两者都使用相同的设置。我认为弹性"fuzziness": "AUTO"等同于lucene的自动将~应用于查询时（尽管我认为你必须自己每个术语添加它，这有点麻烦）。

零条款查询似乎是一个弹性结构。如果您想要ALL设置，则在查询解析器从查询中删除所有令牌时，您必须复制匹配所有查询。

Cutoff frequery看起来与CommonTermsQuery有关。我没有用过这个，所以如果你想使用它，你可能会有一些挖掘。

Lucene有一个synonym filter可以应用于分析器，但您可能需要自己build the map。

您可能会发现的差异可能在得分上。当我运行时，他们会针对lucene查询hey earth。它得到的文档-3和文档-4都以1.3862944的分数返回。当我以下列形式运行查询时：

curl -XPOST http://localhost:9200/index/_search?pretty -d '{
  "query" : {
    "match" : {
      "body" : "hey earth"
    }
  }
}'

我得到相同的文件，但得分为1.219939。您可以对它们进行解释。在lucene中打印每个文档

System.out.println(searcher.explain(query, scoreDoc.doc));

通过查询每个文档来弹性，如

curl -XPOST http://localhost:9200/index/docs/3/_explain?pretty -d '{
  "query" : {
    "match" : {
      "body" : "hey earth"
    }
  }
}'

我得到一些分歧，但我无法准确解释它们。我确实得到1.3862944的文档的值，但fieldLength是不同的，这会影响权重。

模仿Elasticsearch MatchQuery

2 个答案: