使用来自sklearn.feature_extraction.text.TfidfVectorizer的TfidfVectorizer计算IDF

时间:2016-04-20 22:28:04

标签: python scikit-learn

我认为函数TfidfVectorizer无法正确计算IDF因子。 例如,从tf-idf feature weights using sklearn.feature_extraction.text.TfidfVectorizer复制代码:

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["This is very strange",
          "This is very nice"]
vectorizer = TfidfVectorizer(
                        use_idf=True, # utiliza o idf como peso, fazendo tf*idf
                        norm=None, # normaliza os vetores
                        smooth_idf=False, #soma 1 ao N e ao ni => idf = ln(N+1 / ni+1)
                        sublinear_tf=False, #tf = 1+ln(tf)
                        binary=False,
                        min_df=1, max_df=1.0, max_features=None,
                        strip_accents='unicode', # retira os acentos
                        ngram_range=(1,1), preprocessor=None,              stop_words=None, tokenizer=None, vocabulary=None
             )
X = vectorizer.fit_transform(corpus)
idf = vectorizer.idf_
print dict(zip(vectorizer.get_feature_names(), idf))

输出是:

{u'is': 1.0,
 u'nice': 1.6931471805599454,
 u'strange': 1.6931471805599454,
 u'this': 1.0,
 u'very': 1.0}`

但应该是:

{u'is': 0.0,
 u'nice': 0.6931471805599454,
 u'strange': 0.6931471805599454,
 u'this': 0.0,
 u'very': 0.0}

不是吗?我做错了什么?

根据http://www.tfidf.com/,IDF的计算是:

IDF(t) = log_e(Total number of documents / Number of documents with term t in it)

因此,由于术语'this','is'和'very'出现在两个句子中,因此IDF = log_e(2/2)= 0.

“奇怪”和“好”这两个词只出现在两个文件中的一个中,所以log_e(2/1)= 0,69314。

2 个答案:

答案 0 :(得分:5)

在sklearn的暗示中你可能没有发现过两件事:

  1. TfidfTransformersmooth_idf=True作为默认参数
  2. 它总是增加1重量
  3. 所以它正在使用:

    idf = log( 1 + samples/documents) + 1
    

    这是源头:

    https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/text.py#L987-L992

    编辑: 您可以像这样继承标准TfidfVectorizer类:

    import scipy.sparse as sp
    import numpy as np
    from sklearn.feature_extraction.text import (TfidfVectorizer,
                                                 _document_frequency)
    class PriscillasTfidfVectorizer(TfidfVectorizer):
    
        def fit(self, X, y=None):
            """Learn the idf vector (global term weights)
            Parameters
            ----------
            X : sparse matrix, [n_samples, n_features]
                a matrix of term/token counts
            """
            if not sp.issparse(X):
                X = sp.csc_matrix(X)
            if self.use_idf:
                n_samples, n_features = X.shape
                df = _document_frequency(X)
    
                # perform idf smoothing if required
                df += int(self.smooth_idf)
                n_samples += int(self.smooth_idf)
    
                # log+1 instead of log makes sure terms with zero idf don't get
                # suppressed entirely.
                ####### + 1 is commented out ##########################
                idf = np.log(float(n_samples) / df) #+ 1.0  
                #######################################################
                self._idf_diag = sp.spdiags(idf,
                                            diags=0, m=n_features, n=n_features)
    
            return self
    

答案 1 :(得分:1)

他们在计算idf时使用的实际公式(当smooth_idf为True时)是

idf = log( (1 + samples)/(documents + 1)) + 1

它来自源代码,但我认为网络文档有点含糊不清。

https://github.com/scikit-learn/scikit-learn/blob/14031f6/sklearn/feature_extraction/text.py#L966-L969