使用自我。在与Gensim TfIDF

时间:2017-12-19 08:16:31

标签: python class gensim tf-idf

我有一个类TfidfRecommendations,有几个方法和输入。一些输入是Gensims TfIDF model的训练模型对象(下面的函数train_tfidf):

import gensim
from gensim import models, corpora, similarities
def train_tfidf(data):
    dictionary = corpora.Dictionary(data)
    corpus = [dictionary.doc2bow(doc) for doc in data]
    tfidf = models.TfidfModel(corpus)
    index = similarities.SparseMatrixSimilarity(tfidf[corpus], num_features=len(dictionary))
    return tfidf_model, tfidf_dictionary, tfidf_index

其中data是文本文档的pandas数据框(每行一个文档)。

我使用上面train_tfidf的输出作为TfidfRecommendations的输入:

class TfidfRecommendations:

    def __init__(self, data, tfidf_dictionary, tfidf_model, tfidf_index):
        self.data               = data
        self.tfidf_dictionary   = tfidf_dictionary
        self.tfidf_model        = tfidf_model
        self.tfidf_index        = tfidf_index

    ...

    def get_sims(self, query):
        # query is a list of strings to be compared to the corpus data
        vec_bow = self.tfidf_dictionary.doc2bow(query)
        sims = self.tfidf_index[self.tfidf_model[vec_bow]]
        return sims

TfidfRecommendations的问题是它返回sims的元组列表,这是不正确的:

tfidf_model, tfidf_dictionary, tfidf_index = train_tfidf(data)
TFIDF = TfidfRecommendations(data, tfidf_dictionary, tfidf_model, tfidf_index)
sims = TFIDF.get_sims(query_text) # query_text is a list of string tokens
print(sims)
>>>[(4, 0.004360197614450217),
   (19, 0.044387503503385946),
   (46, 0.10344463256852278),
   (82, 0.01845695743910715),
   (125, 0.024611722270581393),
   (133, 0.045794061264144204)]

它应该返回一个长度为len(data)的numpy数组,每个条目都是query_textdata中每一行之间的余弦相似度。如果get_sims是类TfidfRecommendations

之外的独立函数,则此方法可以正常工作
def get_sims(query, tfidf_dictionary, tfidf_index, tfidf_model):
    # query is a list of strings to be compared
    # to the corpus data
    vec_bow = tfidf_dictionary.doc2bow(query)
    sims = tfidf_index[tfidf_model[vec_bow]]
    return sims

get_sims(query, tfidf_dictionary, tfidf_index, tfidf_model)
>>> array([ 0.00123292,  0.0080641 ,  0.00420302, ...,  0.        ,
    0.0101376 ,  0.00987199], dtype=float32)

这里出了什么问题?为什么gensim模型对象不能在类中与self.一起使用?任何帮助将不胜感激。

0 个答案:

没有答案