我必须遵循我想从lucene 6.5.x移植到elasticsearch 5.3.x的简单代码。
然而,分数不同,我希望得到与lucene相同的得分结果。
例如,idf:
Lucenes docFreq 是3(3个文档包含术语" d"), docCount 是4(具有此字段的文档)。 Elasticsearch有1个 docFreq 和2个 docCount (或1和1)。我不确定这些值在弹性搜索中如何相互关联...
得分的另一个不同是avgFieldLength:
Lucene是对的,14/4 = 3.5。 Elasticsearch对于每个得分结果都不同 - 但对于所有文档都应该相同......
请告诉我,我在弹性搜索中错过了哪些设置/映射,以使其像lucene一样工作?
IndexingExample.java:
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.*;
import org.apache.lucene.index.*;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.search.*;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.document.Field;
import java.io.IOException;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.List;
public class IndexingExample {
private static final String INDEX_DIR = "/tmp/lucene6idx";
private IndexWriter createWriter() throws IOException {
FSDirectory dir = FSDirectory.open(Paths.get(INDEX_DIR));
IndexWriterConfig config = new IndexWriterConfig(new StandardAnalyzer());
return new IndexWriter(dir, config);
}
private List<Document> createDocs() {
List<Document> docs = new ArrayList<>();
FieldType summaryType = new FieldType();
summaryType.setIndexOptions(IndexOptions.DOCS_AND_FREQS);
summaryType.setStored(true);
summaryType.setTokenized(true);
Document doc1 = new Document();
doc1.add(new Field("title", "b c d d d", summaryType));
docs.add(doc1);
Document doc2 = new Document();
doc2.add(new Field("title", "b c d d", summaryType));
docs.add(doc2);
Document doc3 = new Document();
doc3.add(new Field("title", "b c d", summaryType));
docs.add(doc3);
Document doc4 = new Document();
doc4.add(new Field("title", "b c", summaryType));
docs.add(doc4);
return docs;
}
private IndexSearcher createSearcher() throws IOException {
Directory dir = FSDirectory.open(Paths.get(INDEX_DIR));
IndexReader reader = DirectoryReader.open(dir);
return new IndexSearcher(reader);
}
public static void main(String[] args) throws IOException, ParseException {
// indexing
IndexingExample app = new IndexingExample();
IndexWriter writer = app.createWriter();
writer.deleteAll();
List<Document> docs = app.createDocs();
writer.addDocuments(docs);
writer.commit();
writer.close();
// search
IndexSearcher searcher = app.createSearcher();
Query q1 = new TermQuery(new Term("title", "d"));
TopDocs hits = searcher.search(q1, 20);
System.out.println(hits.totalHits + " docs found for the query \"" + q1.toString() + "\"");
int num = 0;
for (ScoreDoc sd : hits.scoreDocs) {
Explanation expl = searcher.explain(q1, sd.doc);
System.out.println(expl);
}
}
}
Elasticsearch:
DELETE twitter
PUT twitter/tweet/1
{
"title" : "b c d d d"
}
PUT twitter/tweet/2
{
"title" : "b c d d"
}
PUT twitter/tweet/3
{
"title" : "b c d"
}
PUT twitter/tweet/4
{
"title" : "b c"
}
POST /twitter/tweet/_search
{
"explain": true,
"query": {
"term" : {
"title" : "d"
}
}
}
答案 0 :(得分:0)
在jimczy的帮助下解决了问题:
不要忘记ES默认创建一个包含5个分片的索引 每个分片计算docFreq和docCount。你可以创建一个 使用1个分片的索引或使用dfs模式计算分布式统计信息: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-search-type.html#dfs-query-then-fetch
此搜索查询(dfs_query_then_fetch)按预期工作:
POST /twitter/tweet/_search?search_type=dfs_query_then_fetch
{
"explain": true,
"query": {
"term" : {
"title" : "d"
}
}
}