我为查询字符串和一些文档计算TF-IDF。 我想计算余弦相似度,并将与查询最相关的文档ID列表显示为不太相关。
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
## load the documents (around 200 txt) from path
cranInp=[]
path="D:\\Desktop\\try\\web"
for file in os.listdir(path):
textdir=path+"\\"+file
f=open(textdir).read()
# print f
cranInp.append(f)
Vcount = TfidfVectorizer(analyzer='word', ngram_range=(1,1), stop_words = 'english')
countMatrix = Vcount.fit_transform(cranInp)
Query = "in summarizing theoretical and experimental work on the behaviour of a typical aircraft structure in a noise environment is it possible to develop a design procedure ."
queryVects = Vcount.transform(Query)
k = 50
cosMattf = cosine_similarity(queryVects,countMatrix)
如何获取顶级K(k = 50)文档的列表,例如[12.txt,34.txt,89.txt,90.txt .... 45.txt]列表的大小为50
从最相关到不太相关,如12.txt具有最低的余弦距离,它是查询中最相关的文档。