我正试图从大量文章中找到文档相似性(460个文件,每个文件包含4000行)。但是执行cosine similarity
会花费很多时间。
我不能使用sklearn或scipy的python库。因此,我尝试实现raw tf-idf vectorizer
和cosine similarity
。矢量器给我一个列表清单。
矩阵如下:
[[0.0,0.0,...…,0.35480,0.0,0.0],[0.0,.....]]
我的代码:
def computeTFIDFVector(document):
tfidfVector = [0.0] * len(wordDict)
for i, word in enumerate(wordDict):
if word in document:
tfidfVector[i] = document[word]
return tfidfVector
def cosine_similarity(vector1, vector2):
dot_product = sum(p*q for p,q in zip(vector1, vector2))
magnitude = math.sqrt(sum([val**2 for val in vector1])) * math.sqrt(sum([val**2 for val in vector2]))
if not magnitude:
return 0
return dot_product/magnitude
duplicates=[]
count=0
for i in range(len(tfidfVector)):
for j in range(i+1, len(tfidfVector)):
count=count+1
clear_output()
print(count)
similarity=cosine_similarity(tfidfVector[i],tfidfVector[j])
duplicates.append((i,j,similarity))
现在,预期结果还可以,但计算需要永恒。有什么建议如何使其更快?