如何在获得TF-IDF,cosine_similarity后显示文档ID?蟒蛇

时间:2016-12-06 06:20:24

标签: python matrix scikit-learn nltk tf-idf

我为查询字符串和一些文档计算TF-IDF。 我想计算余弦相似度,并将与查询最相关的文档ID列表显示为不太相关。

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
## load the documents (around 200 txt) from path
cranInp=[]
path="D:\\Desktop\\try\\web"
for file in os.listdir(path):
    textdir=path+"\\"+file
    f=open(textdir).read()
    # print f
    cranInp.append(f)


Vcount = TfidfVectorizer(analyzer='word', ngram_range=(1,1), stop_words = 'english')
countMatrix = Vcount.fit_transform(cranInp)


 Query = "in summarizing theoretical and experimental work on the behaviour of a typical aircraft structure in a noise environment is it possible to develop a design procedure ."
 queryVects  = Vcount.transform(Query)

k = 50
cosMattf = cosine_similarity(queryVects,countMatrix)

如何获取顶级K(k = 50)文档的列表,例如[12.txt,34.txt,89.txt,90.txt .... 45.txt]列表的大小为50

从最相关到​​不太相关,如12.txt具有最低的余弦距离,它是查询中最相关的文档。

0 个答案:

没有答案