我在网上找到了一个用于计算tf-idf和余弦相似度的python教程。我正试着玩它并改变它。
问题在于我有奇怪的结果,几乎没有任何意义。
例如我使用3个文件。 [doc1,doc2,doc3]
doc1和doc2是类似的,doc3完全不同。
结果如下:
[[ 0.00000000e+00 2.20351188e-01 9.04357868e-01]
[ 2.20351188e-01 -2.22044605e-16 8.82546765e-01]
[ 9.04357868e-01 8.82546765e-01 -2.22044605e-16]]
首先,我认为主对角线上的数字应该是1而不是0.之后,doc1和doc2的相似度得分约为0.22,而doc1的doc3约为0.90。我期待相反的结果。你可以检查我的代码,也许可以帮助我理解为什么我有这些结果?
Doc1,doc2和doc3是tokkenized text。
articles = [doc1,doc2,doc3]
corpus = []
for article in articles:
for word in article:
corpus.append(word)
def freq(word, article):
return article.count(word)
def wordCount(article):
return len(article)
def numDocsContaining(word,articles):
count = 0
for article in articles:
if word in article:
count += 1
return count
def tf(word, article):
return (freq(word,article) / float(wordCount(article)))
def idf(word, articles):
return math.log(len(articles) / (1 + numDocsContaining(word,articles)))
def tfidf(word, document, documentList):
return (tf(word,document) * idf(word,documentList))
feature_vectors=[]
for article in articles:
vec=[]
for word in corpus:
if word in article:
vec.append(tfidf(word, article, corpus))
else:
vec.append(0)
feature_vectors.append(vec)
n=len(articles)
mat = numpy.empty((n, n))
for i in xrange(0,n):
for j in xrange(0,n):
mat[i][j] = nltk.cluster.util.cosine_distance(feature_vectors[i],feature_vectors[j])
print mat
答案 0 :(得分:1)
如果您可以尝试任何其他软件包,例如sklearn,请尝试
此代码可能会有所帮助
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.corpus import stopwords
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import numpy.linalg as LA
from sklearn.feature_extraction.text import TfidfVectorizer
f = open("/root/Myfolder/scoringDocuments/doc1")
doc1 = str.decode(f.read(), "UTF-8", "ignore")
f = open("/root/Myfolder/scoringDocuments/doc2")
doc2 = str.decode(f.read(), "UTF-8", "ignore")
f = open("/root/Myfolder/scoringDocuments/doc3")
doc3 = str.decode(f.read(), "UTF-8", "ignore")
train_set = [doc1, doc2, doc3]
test_set = ["age salman khan wife"] #Query
stopWords = stopwords.words('english')
tfidf_vectorizer = TfidfVectorizer(stop_words = stopWords)
tfidf_matrix_test = tfidf_vectorizer.fit_transform(test_set)
print tfidf_vectorizer.vocabulary_
tfidf_matrix_train = tfidf_vectorizer.transform(train_set) #finds the tfidf score with normalization
print 'Fit Vectorizer to train set', tfidf_matrix_train.todense()
print 'Transform Vectorizer to test set', tfidf_matrix_test.todense()
print "\n\ncosine simlarity not separated sets cosine scores ==> ", cosine_similarity(tfidf_matrix_test, tfidf_matrix_train)