我有一个类TfidfRecommendations
,有几个方法和输入。一些输入是Gensims TfIDF model的训练模型对象(下面的函数train_tfidf
):
import gensim
from gensim import models, corpora, similarities
def train_tfidf(data):
dictionary = corpora.Dictionary(data)
corpus = [dictionary.doc2bow(doc) for doc in data]
tfidf = models.TfidfModel(corpus)
index = similarities.SparseMatrixSimilarity(tfidf[corpus], num_features=len(dictionary))
return tfidf_model, tfidf_dictionary, tfidf_index
其中data
是文本文档的pandas数据框(每行一个文档)。
我使用上面train_tfidf
的输出作为TfidfRecommendations
的输入:
class TfidfRecommendations:
def __init__(self, data, tfidf_dictionary, tfidf_model, tfidf_index):
self.data = data
self.tfidf_dictionary = tfidf_dictionary
self.tfidf_model = tfidf_model
self.tfidf_index = tfidf_index
...
def get_sims(self, query):
# query is a list of strings to be compared to the corpus data
vec_bow = self.tfidf_dictionary.doc2bow(query)
sims = self.tfidf_index[self.tfidf_model[vec_bow]]
return sims
类TfidfRecommendations
的问题是它返回sims
的元组列表,这是不正确的:
tfidf_model, tfidf_dictionary, tfidf_index = train_tfidf(data)
TFIDF = TfidfRecommendations(data, tfidf_dictionary, tfidf_model, tfidf_index)
sims = TFIDF.get_sims(query_text) # query_text is a list of string tokens
print(sims)
>>>[(4, 0.004360197614450217),
(19, 0.044387503503385946),
(46, 0.10344463256852278),
(82, 0.01845695743910715),
(125, 0.024611722270581393),
(133, 0.045794061264144204)]
它应该返回一个长度为len(data)
的numpy数组,每个条目都是query_text
与data
中每一行之间的余弦相似度。如果get_sims
是类TfidfRecommendations
def get_sims(query, tfidf_dictionary, tfidf_index, tfidf_model):
# query is a list of strings to be compared
# to the corpus data
vec_bow = tfidf_dictionary.doc2bow(query)
sims = tfidf_index[tfidf_model[vec_bow]]
return sims
get_sims(query, tfidf_dictionary, tfidf_index, tfidf_model)
>>> array([ 0.00123292, 0.0080641 , 0.00420302, ..., 0. ,
0.0101376 , 0.00987199], dtype=float32)
这里出了什么问题?为什么gensim模型对象不能在类中与self.
一起使用?任何帮助将不胜感激。