我的作业是这样的:
import gensim
from sklearn.feature_extraction.text import CountVectorizer
newsgroup_data = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
vect = CountVectorizer(stop_words='english',
token_pattern='(?u)\\b\\w\\w\\w+\\b')
X = vect.fit_transform(newsgroup_data)
corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False)
id_map = dict((v, k) for k, v in vect.vocabulary_.items())
我的任务是估计语料库中的LDA模型参数,找到每个主题中10个主题和最重要的10个单词的列表,我这样做:
top10 = ldamodel.print_topics(num_topics=10, num_words=10)
ldamodel = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=id_map, num_topics=10, minimum_probability=0)
通过自动编程器罚款。接下来的任务是找到我尝试执行的新文档的主题分发,如下所示:
new_doc = ["\n\nIt's my understanding that the freezing will start to occur because \
of the\ngrowing distance of Pluto and Charon from the Sun, due to it's\nelliptical orbit. \
It is not due to shadowing effects. \n\n\nPluto can shadow Charon, and vice-versa.\n\nGeorge \
Krumins\n-- "]
newX = vect.transform(new_doc)
newC = gensim.matutils.Sparse2Corpus(newX, documents_columns=False)
print(ldamodel.get_document_topics(newC))
但这只是返回
gensim.interfaces.TransformedCorpus
我还从文档中看到了这样的声明:“然后,您可以使用>>> doc_lda = lda [doc_bow]”在新的,看不见的文档上推断主题分布,但这里也没有成功。任何帮助表示赞赏。
答案 0 :(得分:1)
继续深入研究,特别是对于接口gensim.interfaces.TransformedCorpus。据我了解,界面指向我要求的主题/分布,但我需要迭代它以查看值。
topic_dist = ldamodel.get_document_topics(newC)
td=[]
for topic in topic_dis:
td.append(topic)
td = td[0]
诀窍。也可以使用
topic_dist = ldamodel[newC]