我使用StandardAnalyzer来索引我的文本。但是,在查询时,我正在进行术语查询和短语查询。对于术语查询和短语查询,我认为lucene在计算术语频率和短语频率方面没有问题。但是,对于像Dirichlet Similarity这样的模型来说这很好。对于BM25Similarity或TFIDFSimilarity模型,它需要IDF(term)和IDF(Phrase)。 lucene如何处理这个问题?
答案 0 :(得分:1)
TFIDFSimilarity短语IDF计算为其组成条款的IDF之和。那就是:idf("ab cd") = idf(ab) + idf(cd)
然后将该值乘以短语频率,并且非常像一个术语,用于评分。
要了解整个故事,我认为看一个例子是最有意义的。 IndexSearcher.explain
是用于理解评分的非常有用的工具:
指数:
查询:"text ab" unique
Explain
输出第一个(最高得分)命中(doc 0):
1.3350155 = (MATCH) sum of:
0.7981777 = (MATCH) weight(content:"text ab" in 0) [DefaultSimilarity], result of:
0.7981777 = score(doc=0,freq=1.0 = phraseFreq=1.0
), product of:
0.7732263 = queryWeight, product of:
2.0645385 = idf(), sum of:
0.7768564 = idf(docFreq=4, maxDocs=4)
1.287682 = idf(docFreq=2, maxDocs=4)
0.37452745 = queryNorm
1.0322692 = fieldWeight in 0, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = phraseFreq=1.0
2.0645385 = idf(), sum of:
0.7768564 = idf(docFreq=4, maxDocs=4)
1.287682 = idf(docFreq=2, maxDocs=4)
0.5 = fieldNorm(doc=0)
0.5368378 = (MATCH) weight(content:unique in 0) [DefaultSimilarity], result of:
0.5368378 = score(doc=0,freq=1.0 = termFreq=1.0
), product of:
0.6341301 = queryWeight, product of:
1.6931472 = idf(docFreq=1, maxDocs=4)
0.37452745 = queryNorm
0.8465736 = fieldWeight in 0, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
1.6931472 = idf(docFreq=1, maxDocs=4)
0.5 = fieldNorm(doc=0)
注意,上半部分,处理查询"text ab"
部分的得分与下半部分(得分unique
)非常相同,除了短语idf的加总和计算
Explain
输出(为了好的衡量标准)(文档2):
0.49384725 = (MATCH) product of:
0.9876945 = (MATCH) sum of:
0.9876945 = (MATCH) weight(content:"text ab" in 2) [DefaultSimilarity], result of:
0.9876945 = score(doc=2,freq=2.0 = phraseFreq=2.0
), product of:
0.7732263 = queryWeight, product of:
2.0645385 = idf(), sum of:
0.7768564 = idf(docFreq=4, maxDocs=4)
1.287682 = idf(docFreq=2, maxDocs=4)
0.37452745 = queryNorm
1.277368 = fieldWeight in 2, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = phraseFreq=2.0
2.0645385 = idf(), sum of:
0.7768564 = idf(docFreq=4, maxDocs=4)
1.287682 = idf(docFreq=2, maxDocs=4)
0.4375 = fieldNorm(doc=2)
0.5 = coord(1/2)