Gensim主题打印错误/问题

时间:2013-03-07 00:24:45

标签: python topic-modeling gensim

所有

这是我在this thread中回复的内容的重新发布。尝试在gensim中打印LSI主题时,我得到了一些完全棘手的结果。这是我的代码:

try:
    from gensim import corpora, models
except ImportError as err:
    print err

class LSI:
    def topics(self, corpus):
        tfidf = models.TfidfModel(corpus)
        corpus_tfidf = tfidf[corpus]
        dictionary = corpora.Dictionary(corpus)
        lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=5)
        print lsi.show_topics()

if __name__ == '__main__':
    data = '../data/data.txt'
    corpus = corpora.textcorpus.TextCorpus(data)
    LSI().topics(corpus)

这会将以下内容打印到控制台。

-0.804*"(5, 1)" + -0.246*"(856, 1)" + -0.227*"(145, 1)" + ......

我希望能够打印出像@ 2er0这样的主题over here,但我得到的结果是这些。请参见下文并注意打印的第二个项目是元组,我不知道它来自何处。 data.txt是一个文本文件,里面有几个段落。就是这样。

对此的任何想法都会很棒!亚当

2 个答案:

答案 0 :(得分:4)

要回答为什么您的LSI主题是元组而不是单词,请检查您的输入语料库。

是根据通过corpus = [dictionary.doc2bow(text) for text in texts]转换为语料库的文档列表创建的吗?

因为如果不是,你只是在没有阅读字典的情况下从序列化语料库中读取它,那么你就不会在主题输出中得到这些单词。

在我的代码下面工作,并用加权词打印出主题:

import gensim as gs

documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

texts = [[word for word in document.lower().split()] for document in documents]
dictionary = gs.corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

tfidf = gs.models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]

lsi = gs.models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=5)
lsi.print_topics()

for i in lsi.print_topics():
    print i

以上输出:

-0.331*"system" + -0.329*"a" + -0.329*"survey" + -0.241*"user" + -0.234*"minors" + -0.217*"opinion" + -0.215*"eps" + -0.212*"graph" + -0.205*"response" + -0.205*"time"
-0.330*"minors" + 0.313*"eps" + 0.301*"system" + -0.288*"graph" + -0.274*"a" + -0.274*"survey" + 0.268*"management" + 0.262*"interface" + 0.208*"human" + 0.189*"engineering"
0.282*"trees" + 0.267*"the" + 0.236*"in" + 0.236*"paths" + 0.236*"intersection" + -0.233*"time" + -0.233*"response" + 0.202*"generation" + 0.202*"unordered" + 0.202*"binary"
-0.247*"generation" + -0.247*"unordered" + -0.247*"random" + -0.247*"binary" + 0.219*"minors" + -0.214*"the" + -0.214*"to" + -0.214*"error" + -0.214*"perceived" + -0.214*"relation"
0.333*"machine" + 0.333*"for" + 0.333*"lab" + 0.333*"abc" + 0.333*"applications" + 0.258*"computer" + -0.214*"system" + -0.194*"eps" + -0.191*"and" + -0.188*"testing"

答案 1 :(得分:0)

它看起来很难看但是这样做(只是一种纯粹基于字符串的方法):

#x = lsi.show_topics()
x = '-0.804*"(5, 1)" + -0.246*"(856, 1)" + -0.227*"(145, 1)"'
y = [(j.split("*")[0], (j.split("*")[1].split(",")[0].lstrip('"('), j.split("*")[1].split(",")[1].strip().rstrip(')"'))) for j in [i for i in x.strip().split(" + ")]]

for i in y:
  print y

以上输出:

('-0.804', ('5', '1'))
('-0.246', ('856', '1'))
('-0.227', ('145', '1'))

如果没有,你可以尝试lsi.print_topic(i)而不是lsi.show_topics()

for i in range(len(lsi.show_topics())):
  print lsi.print_topic(i)