我的目标是为一些每年更新的SEC归档文件计算文件相似度。我开始使用word2vec和软余弦相似度(代码中的方法1)。但是,由于每年仅在文档中更新部分内容,因此我获得了相当高的相似度评分。
我想在我的方法中加入TF-IDF,以减少重复出现的内容的影响,并更多地关注文档的更改。代码中的方法2和方法3都给出了合理的数字。
正确的方法是什么?或更具体地说,相似性_matrix()函数调用中的tfidf参数有什么用?
提前感谢:)
import gensim
from gensim.utils import simple_preprocess
from gensim import corpora
from gensim.matutils import softcossim
import gensim.downloader as api
# load google's pre-trained model
word2vec = api.load('word2vec-google-news-300')
# prepare the inputs, each article_x contain couple paragraphs
documents= [article_1, article_2, article_3]
dictionary = corpora.Dictionary([simple_preprocess(doc) for doc in documents])
sentences = [dictionary.doc2bow(simple_preprocess(i)) for i in documents]
tfidf_fit = gensim.models.TfidfModel(sentences)
sentences_tfidf = tfidf_fit[sentences]
# calculate similarity matrix
sim_mat_none = word2vec.similarity_matrix(dictionary, tfidf=None, threshold=0.0, exponent=2.0, nonzero_limit=100)
sim_mat_tfidf = word2vec.similarity_matrix(dictionary, tfidf=tfidf_fit, threshold=0.0, exponent=2.0, nonzero_limit=100)
# calculate soft cosine similarity
# method 1
softcossim(sentences[0], sentences[1], sim_mat_none )
# method 2
softcossim(sentences_tfidf[0], sentences_tfidf[1], sim_mat_tfidf)
# method 3
softcossim(sentences_tfidf[0], sentences_tfidf[1], sim_mat_none)