下面是gensim的例子,但每当我执行它时, 它显示出不同的结果,所以我无法相信gensim效果很好。
from gensim import corpora, models, similarities
from collections import defaultdict
documents = ["Human machine interface for lab abc computer applications", # 0
"A survey of user opinion of computer system response time", # 1
"The EPS user interface management system", # 2
"System and human system engineering testing of EPS", # 3
"Relation of user perceived response time to error measurement", # 4
"The generation of random binary unordered trees", # 5
"The intersection graph of paths in trees", # 6
"Graph minors IV Widths of trees and well quasi ordering", # 7
"Graph minors A survey"] # 8
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
for document in documents]
frequency = defaultdict(int)
for text in texts:
for token in text:
frequency[token] += 1
texts = [[token for token in text if frequency[token] > 1]
for text in texts]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lda = models.LdaModel(corpus, id2word=dictionary, num_topics=2)
index = similarities.MatrixSimilarity(lda[corpus])
doc = "Human computer interaction"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lda = lda[vec_bow]
sims = index[vec_lda]
sims = sorted(enumerate(sims), key=lambda item: -item[1])
print(sims)
print(lda.get_document_topics(vec_bow))
结果
[( 0 ,0.9986434),(4,0.99792993),(2,0.99722278),(3,0.99651831),(1,0.9958639),(5,0.53059661),(6, 0.4146674),(8,0.38019019),(7,0.36143348)] [(0,0.18366596),(1,0.81633401)]
[( 1 ,0.999605),(4,0.9981864),(0,0.998689),(5,0.62957084),(6,0.48837978),(8,0.45152202),(3, 0.4541581),(7,0.41751832),(2,0.40637407)] [(0,0.80285221),(1,0.19714773)]
[( 7 ,0.99957085),(8,0.99660784),(0,0.99202132),(5,0.78449017),(6,0.77530348),(2,0.56972337),(3, 0.47117239),(4,0.47092015),(1,0.4172135)] [(0,0.25292286),(1,0.7707717)]
结果7看起来与“人机交互”看起来并不相似。 感谢。