将代码从lucene移植到elasticsearch

时间:2017-05-01 15:59:19

标签: java elasticsearch lucene

我必须遵循我想从lucene 6.5.x移植到elasticsearch 5.3.x的简单代码。

然而,分数不同,我希望得到与lucene相同的得分结果。

例如,idf:

Lucenes docFreq 是3(3个文档包含术语" d"), docCount 是4(具有此字段的文档)。 Elasticsearch有1个 docFreq 和2个 docCount (或1和1)。我不确定这些值在弹性搜索中如何相互关联...

得分的另一个不同是avgFieldLength:

Lucene是对的,14/4 = 3.5。 Elasticsearch对于每个得分结果都不同 - 但对于所有文档都应该相同......

请告诉我,我在弹性搜索中错过了哪些设置/映射,以使其像lucene一样工作?

IndexingExample.java:

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.*;
import org.apache.lucene.index.*;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.search.*;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.document.Field;

import java.io.IOException;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.List;

public class IndexingExample {
    private static final String INDEX_DIR = "/tmp/lucene6idx";

    private IndexWriter createWriter() throws IOException {
        FSDirectory dir = FSDirectory.open(Paths.get(INDEX_DIR));
        IndexWriterConfig config = new IndexWriterConfig(new StandardAnalyzer());
        return new IndexWriter(dir, config);
    }

    private List<Document> createDocs() {
        List<Document> docs = new ArrayList<>();
        FieldType summaryType = new FieldType();
        summaryType.setIndexOptions(IndexOptions.DOCS_AND_FREQS);
        summaryType.setStored(true);
        summaryType.setTokenized(true);

        Document doc1 = new Document();
        doc1.add(new Field("title", "b c d d d", summaryType));
        docs.add(doc1);
        Document doc2 = new Document();
        doc2.add(new Field("title", "b c d d", summaryType));
        docs.add(doc2);
        Document doc3 = new Document();
        doc3.add(new Field("title", "b c d", summaryType));
        docs.add(doc3);
        Document doc4 = new Document();
        doc4.add(new Field("title", "b c", summaryType));
        docs.add(doc4);

        return docs;
    }

    private IndexSearcher createSearcher() throws IOException {
        Directory dir = FSDirectory.open(Paths.get(INDEX_DIR));
        IndexReader reader = DirectoryReader.open(dir);
        return new IndexSearcher(reader);
    }

    public static void main(String[] args) throws IOException, ParseException {
        // indexing
        IndexingExample app = new IndexingExample();
        IndexWriter writer = app.createWriter();
        writer.deleteAll();
        List<Document> docs = app.createDocs();
        writer.addDocuments(docs);
        writer.commit();
        writer.close();

        // search
        IndexSearcher searcher = app.createSearcher();
        Query q1 = new TermQuery(new Term("title", "d"));
        TopDocs hits = searcher.search(q1, 20);
        System.out.println(hits.totalHits + " docs found for the query \"" + q1.toString() + "\"");
        int num = 0;
        for (ScoreDoc sd : hits.scoreDocs) {
            Explanation expl = searcher.explain(q1, sd.doc);
            System.out.println(expl);
        }
    }
}

Elasticsearch:

DELETE twitter

PUT twitter/tweet/1
{
    "title" : "b c d d d"
}

PUT twitter/tweet/2
{
    "title" : "b c d d"
}

PUT twitter/tweet/3
{
    "title" : "b c d"
}

PUT twitter/tweet/4
{
    "title" : "b c"
}

POST /twitter/tweet/_search
{
    "explain": true, 
    "query": {
        "term" : {
            "title" : "d"
        }
    }
}

1 个答案:

答案 0 :(得分:0)

jimczy的帮助下解决了问题:

  

不要忘记ES默认创建一个包含5个分片的索引   每个分片计算docFreq和docCount。你可以创建一个   使用1个分片的索引或使用dfs模式计算分布式统计信息:   https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-search-type.html#dfs-query-then-fetch

此搜索查询(dfs_query_then_fetch)按预期工作:

POST /twitter/tweet/_search?search_type=dfs_query_then_fetch
{
        "explain": true, 
        "query": {
                "term" : {
                        "title" : "d"
                }
        }
}