当我运行以下代码时,我得到的分数与学期频率完全相同。我期望tf-idf得分看起来像文档页面最后一节中提到的得分。...
如何获得给定学期的正确分数?
DELETE stack/
PUT stack/mydata/1
{
"body": "The sun in the sky is bright."
}
PUT stack/mydata/2
{
"body": "The sun in the sky is bright."
}
PUT stack/mydata/3
{
"body": "The sun in the sky is bright."
}
GET /stack/mydata/3/_termvectors?fields=body
{
"term_statistics" : false,
"field_statistics" : true,
"positions": false,
"offsets": false,
"filter" : {
"max_num_terms" : 8,
"min_term_freq" : 1,
"min_doc_freq" : 1
}
}
更新:
即使使用不同的文档,分数似乎也没有改变。
DELETE stack/
PUT stack/mydata/1
{
"body": "The sea is blue."
}
PUT stack/mydata/2
{
"body": "The sun in the sky is bright."
}
PUT stack/mydata/3
{
"body": "The sun is away and sun is powerful."
}
GET /stack/mydata/3/_termvectors?fields=body
{
"term_statistics" : false,
"field_statistics" : true,
"positions": false,
"offsets": false,
"filter" : {
"max_num_terms" : 8,
"min_term_freq" : 1,
"min_doc_freq" : 1
}
}
例如在Kibana中,我看到“强大”一词的得分为0.258。我希望使用此API(termvectors)返回该分数。
更新: 我正在尝试匹配sklearn模块返回的idf分数。
import nltk
import string
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
train_set = ["sky is blue", "sun is bright", "sun in the sky is bright"]
stop_words = stopwords.words('english')
transformer = TfidfVectorizer(stop_words=stop_words)
transformer.fit_transform(train_set).todense()
以上代码返回的tf-idf值与弹性搜索值不匹配。
matrix([[0.79596054, 0. , 0.60534851, 0. ],
[0. , 0.70710678, 0. , 0.70710678],
[0. , 0.57735027, 0.57735027, 0.57735027]])