我是Lucene的新人。我在使用Lucene-3.6.0.jar的java中使用Lucene。我按照http://www.tutorialspoint.com/lucene/的教程进行了操作。我的基本代码如下:
public class LuceneTester {
String indexDir = "Data/Indexdir";
String dataDir = "Data/Datadir";
Indexer indexer;
Searcher searcher;
public static void test() {
LuceneTester tester;
try {
tester = new LuceneTester();
tester.createIndex();
tester.search("malformed");
} catch (IOException e) {
e.printStackTrace();
} catch (ParseException e) {
e.printStackTrace();
}
}
private void createIndex() throws IOException {
indexer = new Indexer(indexDir);
int numIndexed;
long startTime = System.currentTimeMillis();
numIndexed = indexer.createIndex(dataDir, new TextFileFilter());
long endTime = System.currentTimeMillis();
indexer.close();
System.out.println(numIndexed + " File indexed, time taken: "
+ (endTime - startTime) + " ms");
}
private void search(String searchQuery) throws IOException, ParseException {
searcher = new Searcher(indexDir);
long startTime = System.currentTimeMillis();
Term term = new Term(LuceneConstants.CONTENTS, searchQuery);
Query query = new FuzzyQuery(term);
System.out.println("Query: " + query.toString());
TopDocs hits = searcher.search(query, Sort.RELEVANCE);
long endTime = System.currentTimeMillis();
System.out.println(hits.totalHits + " documents found. Time :"
+ (endTime - startTime));
for (ScoreDoc scoreDoc : hits.scoreDocs) {
Document doc = searcher.getDocument(scoreDoc);
System.out.println("File: " + doc.get(LuceneConstants.FILE_PATH));
}
searcher.close();
}
现在,我想使用BM25相似性而不是默认评分技术。怎么做?
答案 0 :(得分:1)
Lucene 4.0之前的Lucene版本没有BM25所需的所有信息,即文档级IDF和平均字段长度,因此在这样的旧版本中实现BM25是不可能的(您可以存储外部所需的信息和/或者近似它们,请参阅:http://www.slideshare.net/yuvalf/bm25-scoring-for-lucene-from-academia-to-industry了解一个想法)。
从4.0开始,Lucene包含了BM25的(最初的实验)实现:https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/BM25Similarity.html
由于femtoRgon建议使用Lucene 6或更新版本,因此可以为您提供开箱即用的BM25。如果这对您没有帮助,您至少可以使用Lucene 4+,您可以将默认相似性更改为BM25:
IndexSearcher searcher = ...
searcher.setSimilarity(new BM25Similarity());