我有两种类型的文档,一种带有标签,而另一种则没有。 我想使用带标签的文档来计算tf,并使用其他未标签的文档来计算idf。
from gensim import corpora, models, similarities
dictionary = corpora.Dictionary(line.lower().split(',') for line in open('data/unannotated.csv',encoding='utf-8'))
dictionary.filter_extremes(no_below=5, no_above=0.5, keep_n=100000, keep_tokens=None)
此未注释的是未标记的数据,一行是一个文档。 根据这些数据,我得到了一个字典:
{'get': 353867,
'http': 351618,
'u': 324711,
'one': 291526,
'go': 279237,
'know': 265001,
'good': 249368,
'say': 236003,
'like': 225619,
'qatar': 223010,
'time': 191195,
'would': 188766,
'think': 187128,
'people': 182873,
'make': 180120,
'need': 166287,
'take': 157106,
...
}
我使用下一段代码来计算tfidf,它给出所有相同的答案:0.9999999999999946
def TFCalculator(question, word):
if word not in question:
return 0
count = dict(Counter(question))
q_len = len(question)
return float(count[word]) / float(q_len)
def n_containing(unannotated, word):
return float(unannotated.get(word,0))
def IDFCalculator(unannotated, word):
return math.log(float(len(unannotated.keys())) / (1.0 + n_containing(unannotated, word)))
def tfidf(stem, unannotated):
tfidfVector_ques = range(len(unannotated))
for word in stem:
tf= TFCalculator(stem,word)
idf = IDFCalculator(unannotated, word)
tfidf = tf * idf
try:
tfidfVector_ques[list(unannotated).index(word)] = tfidf
except:
pass
return sparse.csr_matrix(tfidfVector_ques)
def cosine_similarities(question,comment,unannotated):
X_tfidf = tfidf(question,unannotated)
Y_tfidf = tfidf(comment,unannotated)
similarities = cosine_similarity(X_tfidf,Y_tfidf)
return similarities