使用termvectors API的IDF得分

时间:2018-08-08 04:50:02

标签: elasticsearch scikit-learn nlp tf-idf tfidfvectorizer

当我运行以下代码时,我得到的分数与学期频率完全相同。我期望tf-idf得分看起来像文档页面最后一节中提到的得分。...

https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-termvectors.html#docs-termvectors-terms-filtering

如何获得给定学期的正确分数?

DELETE stack/

PUT stack/mydata/1
{
  "body": "The sun in the sky is bright."
}

PUT stack/mydata/2
{
  "body": "The sun in the sky is bright."
}

PUT stack/mydata/3
{
  "body": "The sun in the sky is bright."
}

GET /stack/mydata/3/_termvectors?fields=body
{
    "term_statistics" : false,
    "field_statistics" : true,
    "positions": false,
    "offsets": false,
    "filter" : {
      "max_num_terms" : 8,
      "min_term_freq" : 1,
      "min_doc_freq" : 1
    }
}

更新:

即使使用不同的文档,分数似乎也没有改变。

DELETE stack/

PUT stack/mydata/1
{
  "body": "The sea is blue."
  }

PUT stack/mydata/2
{
  "body": "The sun in the sky is bright."
}

PUT stack/mydata/3
{
  "body": "The sun is away and sun is powerful."
  }

GET /stack/mydata/3/_termvectors?fields=body
{
    "term_statistics" : false,
    "field_statistics" : true,
    "positions": false,
    "offsets": false,
    "filter" : {
      "max_num_terms" : 8,
      "min_term_freq" : 1,
      "min_doc_freq" : 1
    }
}

例如在Kibana中,我看到“强大”一词的得分为0.258。我希望使用此API(termvectors)返回该分数。


更新: 我正在尝试匹配sklearn模块返回的idf分数。

import nltk
import string

from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords

train_set = ["sky is blue", "sun is bright", "sun in the sky is bright"]
stop_words = stopwords.words('english')

transformer = TfidfVectorizer(stop_words=stop_words)

transformer.fit_transform(train_set).todense()

以上代码返回的tf-idf值与弹性搜索值不匹配。

matrix([[0.79596054, 0.        , 0.60534851, 0.        ],
        [0.        , 0.70710678, 0.        , 0.70710678],
        [0.        , 0.57735027, 0.57735027, 0.57735027]])

0 个答案:

没有答案