执行gensim示例的不同结果

时间:2018-06-14 02:53:16

标签: python document similarity gensim cosine-similarity

下面是gensim的例子,但每当我执行它时, 它显示出不同的结果,所以我无法相信gensim效果很好。

from gensim import corpora, models, similarities
from collections import defaultdict

documents = ["Human machine interface for lab abc computer applications",          # 0
             "A survey of user opinion of computer system response time",          # 1
             "The EPS user interface management system",                           # 2
             "System and human system engineering testing of EPS",                 # 3
             "Relation of user perceived response time to error measurement",      # 4
             "The generation of random binary unordered trees",                    # 5
             "The intersection graph of paths in trees",                           # 6
             "Graph minors IV Widths of trees and well quasi ordering",            # 7 
             "Graph minors A survey"]                                              # 8


stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]

frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1
texts = [[token for token in text if frequency[token] > 1]
         for text in texts]

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lda = models.LdaModel(corpus, id2word=dictionary, num_topics=2)
index = similarities.MatrixSimilarity(lda[corpus])


doc = "Human computer interaction"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lda = lda[vec_bow]
sims = index[vec_lda]
sims = sorted(enumerate(sims), key=lambda item: -item[1])
print(sims)

print(lda.get_document_topics(vec_bow))

结果

[( 0 ,0.9986434),(4,0.99792993),(2,0.99722278),(3,0.99651831),(1,0.9958639),(5,0.53059661),(6, 0.4146674),(8,0.38019019),(7,0.36143348)] [(0,0.18366596),(1,0.81633401)]

[( 1 ,0.999605),(4,0.9981864),(0,0.998689),(5,0.62957084),(6,0.48837978),(8,0.45152202),(3, 0.4541581),(7,0.41751832),(2,0.40637407)] [(0,0.80285221),(1,0.19714773)]

[( 7 ,0.99957085),(8,0.99660784),(0,0.99202132),(5,0.78449017),(6,0.77530348),(2,0.56972337),(3, 0.47117239),(4,0.47092015),(1,0.4172135)] [(0,0.25292286),(1,0.7707717)]

结果7看起来与“人机交互”看起来并不相似。 感谢。

0 个答案:

没有答案