将新文本添加到Sklearn TFIDIF Vectorizer(Python)

时间:2016-08-23 20:00:40

标签: python scikit-learn tf-idf

是否有添加到现有语料库的功能?我已经生成了我的矩阵,我希望定期添加到表中而不需要重新整理整个sha-bang

e.g;

articleList = ['here is some text blah blah','another text object', 'more foo for your bar right now']
tfidf_vectorizer = TfidfVectorizer(
                        max_df=.8,
                        max_features=2000,
                        min_df=.05,
                        preprocessor=prep_text,
                        use_idf=True,
                        tokenizer=tokenize_text
                    )
tfidf_matrix = tfidf_vectorizer.fit_transform(articleList)

#### ADDING A NEW ARTICLE TO EXISTING SET?
bigger_tfidf_matrix = tfidf_vectorizer.fit_transform(['the last article I wanted to add'])

1 个答案:

答案 0 :(得分:10)

您可以直接访问矢量化工具的vocabulary_属性,并且可以通过idf_访问_tfidf._idf_diag向量,因此可以对此进行修补:< / p>

import re 
import numpy as np
from scipy.sparse.dia import dia_matrix
from sklearn.feature_extraction.text import TfidfVectorizer

def partial_fit(self, X):
    max_idx = max(self.vocabulary_.values())
    for a in X:
        #update vocabulary_
        if self.lowercase: a = a.lower()
        tokens = re.findall(self.token_pattern, a)
        for w in tokens:
            if w not in self.vocabulary_:
                max_idx += 1
                self.vocabulary_[w] = max_idx

        #update idf_
        df = (self.n_docs + self.smooth_idf)/np.exp(self.idf_ - 1) - self.smooth_idf
        self.n_docs += 1
        df.resize(len(self.vocabulary_))
        for w in tokens:
            df[self.vocabulary_[w]] += 1
        idf = np.log((self.n_docs + self.smooth_idf)/(df + self.smooth_idf)) + 1
        self._tfidf._idf_diag = dia_matrix((idf, 0), shape=(len(idf), len(idf)))

TfidfVectorizer.partial_fit = partial_fit
articleList = ['here is some text blah blah','another text object', 'more foo for your bar right now']
vec = TfidfVectorizer()
vec.fit(articleList)
vec.n_docs = len(articleList)
vec.partial_fit(['the last text I wanted to add'])
vec.transform(['the last text I wanted to add']).toarray()

# array([[ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
#          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
#          0.        ,  0.        ,  0.27448674,  0.        ,  0.43003652,
#          0.43003652,  0.43003652,  0.43003652,  0.43003652]])