如何使用查询术语tfidf作为Lucene中文档相似度计算的一个因素

时间:2014-05-18 13:36:45

标签: lucene information-retrieval

我试图通过Lucene实现显式语义分析(ESA)。

在匹配文档时,如何在查询中考虑术语TFIDF?

例如:

  • 查询:" a b c a d a"
  • Doc1:" a b a"
  • Doc2:" a b c"

查询应优先于Doc1匹配。

我希望在不影响性能的情况下工作。

我通过查询提升来做到这一点。通过提高相对于其TFIDF的条款。

有更好的方法吗?

1 个答案:

答案 0 :(得分:1)

Lucene当然默认支持TF / IDF评分,所以不太清楚我理解你在寻找什么。

实际上听起来有点像你想根据查询本身的TF / IDF来权衡查询词。因此,让我们考虑一下这两个要素:

  • TF:Lucene总结每个查询词的分数。如果相同的查询字词出现两次,在查询中(如field:(a a b)),加倍的术语会比较(相当于)提升2的重量更重。

  • IDF:idf是指多文档语料库中的数据。由于只有一个查询,因此不适用。或者,如果您想获得技术,所有术语的idf均为1.

因此,IDF在这种情况下并没有真正意义,TF已经为你完成了。所以,你真的不需要做任何事情。

记住,想一想,还有其他得分元素! coord因素在这里很重要。

  • a b a匹配四个查询字词a b a a,但不匹配c d
  • a b c匹配五个查询字词a b a c a,但不匹配d

因此,该特定评分元素将更强烈地评分第二个文档。


以下是文档explain的{​​{1}}(请参阅IndexSearcher.explain)输出:

a b a

对于文档0.26880693 = (MATCH) product of: 0.40321037 = (MATCH) sum of: 0.10876686 = (MATCH) weight(text:a in 0) [DefaultSimilarity], result of: 0.10876686 = score(doc=0,freq=2.0 = termFreq=2.0 ), product of: 0.25872254 = queryWeight, product of: 0.5945349 = idf(docFreq=2, maxDocs=2) 0.435168 = queryNorm 0.42039964 = fieldWeight in 0, product of: 1.4142135 = tf(freq=2.0), with freq of: 2.0 = termFreq=2.0 0.5945349 = idf(docFreq=2, maxDocs=2) 0.5 = fieldNorm(doc=0) 0.07690979 = (MATCH) weight(text:b in 0) [DefaultSimilarity], result of: 0.07690979 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 0.25872254 = queryWeight, product of: 0.5945349 = idf(docFreq=2, maxDocs=2) 0.435168 = queryNorm 0.29726744 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 0.5945349 = idf(docFreq=2, maxDocs=2) 0.5 = fieldNorm(doc=0) 0.10876686 = (MATCH) weight(text:a in 0) [DefaultSimilarity], result of: 0.10876686 = score(doc=0,freq=2.0 = termFreq=2.0 ), product of: 0.25872254 = queryWeight, product of: 0.5945349 = idf(docFreq=2, maxDocs=2) 0.435168 = queryNorm 0.42039964 = fieldWeight in 0, product of: 1.4142135 = tf(freq=2.0), with freq of: 2.0 = termFreq=2.0 0.5945349 = idf(docFreq=2, maxDocs=2) 0.5 = fieldNorm(doc=0) 0.10876686 = (MATCH) weight(text:a in 0) [DefaultSimilarity], result of: 0.10876686 = score(doc=0,freq=2.0 = termFreq=2.0 ), product of: 0.25872254 = queryWeight, product of: 0.5945349 = idf(docFreq=2, maxDocs=2) 0.435168 = queryNorm 0.42039964 = fieldWeight in 0, product of: 1.4142135 = tf(freq=2.0), with freq of: 2.0 = termFreq=2.0 0.5945349 = idf(docFreq=2, maxDocs=2) 0.5 = fieldNorm(doc=0) 0.6666667 = coord(4/6)

a b c

请注意,根据需要,对于术语0.43768594 = (MATCH) product of: 0.52522314 = (MATCH) sum of: 0.07690979 = (MATCH) weight(text:a in 1) [DefaultSimilarity], result of: 0.07690979 = score(doc=1,freq=1.0 = termFreq=1.0 ), product of: 0.25872254 = queryWeight, product of: 0.5945349 = idf(docFreq=2, maxDocs=2) 0.435168 = queryNorm 0.29726744 = fieldWeight in 1, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 0.5945349 = idf(docFreq=2, maxDocs=2) 0.5 = fieldNorm(doc=1) 0.07690979 = (MATCH) weight(text:b in 1) [DefaultSimilarity], result of: 0.07690979 = score(doc=1,freq=1.0 = termFreq=1.0 ), product of: 0.25872254 = queryWeight, product of: 0.5945349 = idf(docFreq=2, maxDocs=2) 0.435168 = queryNorm 0.29726744 = fieldWeight in 1, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 0.5945349 = idf(docFreq=2, maxDocs=2) 0.5 = fieldNorm(doc=1) 0.07690979 = (MATCH) weight(text:a in 1) [DefaultSimilarity], result of: 0.07690979 = score(doc=1,freq=1.0 = termFreq=1.0 ), product of: 0.25872254 = queryWeight, product of: 0.5945349 = idf(docFreq=2, maxDocs=2) 0.435168 = queryNorm 0.29726744 = fieldWeight in 1, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 0.5945349 = idf(docFreq=2, maxDocs=2) 0.5 = fieldNorm(doc=1) 0.217584 = (MATCH) weight(text:c in 1) [DefaultSimilarity], result of: 0.217584 = score(doc=1,freq=1.0 = termFreq=1.0 ), product of: 0.435168 = queryWeight, product of: 1.0 = idf(docFreq=1, maxDocs=2) 0.435168 = queryNorm 0.5 = fieldWeight in 1, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 1.0 = idf(docFreq=1, maxDocs=2) 0.5 = fieldNorm(doc=1) 0.07690979 = (MATCH) weight(text:a in 1) [DefaultSimilarity], result of: 0.07690979 = score(doc=1,freq=1.0 = termFreq=1.0 ), product of: 0.25872254 = queryWeight, product of: 0.5945349 = idf(docFreq=2, maxDocs=2) 0.435168 = queryNorm 0.29726744 = fieldWeight in 1, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 0.5945349 = idf(docFreq=2, maxDocs=2) 0.5 = fieldNorm(doc=1) 0.8333333 = coord(5/6) 的匹配在第一个文档中获得更高的权重,并且您还会看到单独评估的每个独立a并添加到分数中。

另请注意,coord的差异以及第二个doc中术语“c”的idf。这些得分影响正在消除你从增加同一个词的倍数所获得的提升。如果您在查询中添加了足够的a,它们最终会交换位置。 a上的匹配仅被评估为更有意义的结果。