我了解如何在索引时或查询时提升字段。但是,如何才能增加匹配标题开头附近的分数?
示例:
Query = "lucene"
Doc1 title = "Lucene: Homepage"
Doc2 title = "I have a question about lucene?"
我希望第一份文件得分更高,因为“lucene”更接近开头(暂时忽略术语频率)。
我看到如何使用SpanQuery来指定术语之间的接近程度,但我不确定如何使用有关该字段中位置的信息。
我在Java中使用Lucene 4.1。
答案 0 :(得分:10)
我会使用SpanFirstQuery
,它匹配字段开头附近的字词。由于所有跨度查询都依赖于位置,默认情况下在lucene中进行索引时启用。
让我们独立测试:您只需要提供SpanTermQuery
和可以找到该术语的最大位置(在我的示例中为一个)。
SpanTermQuery spanTermQuery = new SpanTermQuery(new Term("title", "lucene"));
SpanFirstQuery spanFirstQuery = new SpanFirstQuery(spanTermQuery, 1);
如果您使用StandardAnalyzer
分析了这两个文档,那么此查询将只找到标题为“Lucene:Homepage”的第一个文档。
现在我们可以以某种方式将上述SpanFirstQuery
与普通文本查询相结合,并使第一个仅影响分数。您可以使用BooleanQuery
轻松地执行此操作,并将span查询作为这样的should子句:
Term term = new Term("title", "lucene");
TermQuery termQuery = new TermQuery(term);
SpanFirstQuery spanFirstQuery = new SpanFirstQuery(new SpanTermQuery(term), 1);
BooleanQuery booleanQuery = new BooleanQuery();
booleanQuery.add(new BooleanClause(termQuery, BooleanClause.Occur.MUST));
booleanQuery.add(new BooleanClause(spanFirstQuery, BooleanClause.Occur.SHOULD));
可能有不同的方法来实现相同,也许使用CustomScoreQuery
或自定义代码来实现评分,但在我看来这是最简单的。
我用来测试它的代码打印出以下输出(包括得分),首先执行TermQuery
,然后是唯一SpanFirstQuery
,最后是BooleanQuery
合并:
------ TermQuery --------
Total hits: 2
title: I have a question about lucene - score: 0.26010898
title: Lucene: I have a really hard question about it - score: 0.22295055
------ SpanFirstQuery --------
Total hits: 1
title: Lucene: I have a really hard question about it - score: 0.15764984
------ BooleanQuery: TermQuery (MUST) + SpanFirstQuery (SHOULD) --------
Total hits: 2
title: Lucene: I have a really hard question about it - score: 0.26912516
title: I have a question about lucene - score: 0.09196242
以下是完整的代码:
public static void main(String[] args) throws Exception {
Directory directory = FSDirectory.open(new File("data"));
index(directory);
IndexReader indexReader = DirectoryReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
Term term = new Term("title", "lucene");
System.out.println("------ TermQuery --------");
TermQuery termQuery = new TermQuery(term);
search(indexSearcher, termQuery);
System.out.println("------ SpanFirstQuery --------");
SpanFirstQuery spanFirstQuery = new SpanFirstQuery(new SpanTermQuery(term), 1);
search(indexSearcher, spanFirstQuery);
System.out.println("------ BooleanQuery: TermQuery (MUST) + SpanFirstQuery (SHOULD) --------");
BooleanQuery booleanQuery = new BooleanQuery();
booleanQuery.add(new BooleanClause(termQuery, BooleanClause.Occur.MUST));
booleanQuery.add(new BooleanClause(spanFirstQuery, BooleanClause.Occur.SHOULD));
search(indexSearcher, booleanQuery);
}
private static void index(Directory directory) throws Exception {
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_41, new StandardAnalyzer(Version.LUCENE_41));
IndexWriter writer = new IndexWriter(directory, config);
FieldType titleFieldType = new FieldType();
titleFieldType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS);
titleFieldType.setIndexed(true);
titleFieldType.setStored(true);
Document document = new Document();
document.add(new Field("title","I have a question about lucene", titleFieldType));
writer.addDocument(document);
document = new Document();
document.add(new Field("title","Lucene: I have a really hard question about it", titleFieldType));
writer.addDocument(document);
writer.close();
}
private static void search(IndexSearcher indexSearcher, Query query) throws Exception {
TopDocs topDocs = indexSearcher.search(query, 10);
System.out.println("Total hits: " + topDocs.totalHits);
for (ScoreDoc hit : topDocs.scoreDocs) {
Document result = indexSearcher.doc(hit.doc);
for (IndexableField field : result) {
System.out.println(field.name() + ": " + field.stringValue() + " - score: " + hit.score);
}
}
}
答案 1 :(得分:0)
来自“Lucene In Action 2”一书
“Lucene在包中提供了内置查询PayloadTermQuery org.apache.lucene.search.payloads。这个查询就是 像SpanTermQuery一样,它匹配包含指定术语的所有文档 并跟踪匹配的 实际发生次数(跨度) 。
但随后它可以让您根据出现的有效负载贡献一个评分因子 在每个学期的发生。为此,您必须创建自己的Similarity类 它定义了scorePayload方法,就像这个“
public class BoostingSimilarity extends DefaultSimilarity {
public float scorePayload(int docID, String fieldName,
int start, int end, byte[] payload,
int offset, int length) {
....
}
上面代码中的“start”只是有效负载的起始位置。有效负载与该术语相关联。因此,起始位置也适用于该术语(至少这是我所相信的......)
通过使用上述代码,但忽略有效负载,您可以访问评分地点的“开始”位置,然后您可以根据该起始值提高分数。
例如:新分数=原始分数*(1.0f /起始位置)
我希望上述内容有效,如果您找到其他有效的解决方案,请在此发布。