Python gensim LDA:获取主题后将主题添加到文档中

时间:2017-10-04 21:48:24

标签: python gensim lda

我使用gensim的LDA来执行主题建模。我知道如何将原始文本数据转换为语料库并获取主题。但是,在我获得主题后,我可以将主题结果标记或添加回原始文档吗?

以下是我的代码:

movie_reviews = pd.read_csv(data_path + 'movie_review.tsv',header=0,delimiter='\t',quoting=3) 

reviews = []
for i in range(len(movie_reviews['review'])):
reviews.append(review_to_words(movie_reviews['review']
              [i],stops=stopwords.words('english')))
from gensim import corpora
dictionary = corpora.Dictionary(reviews)
corpus = [dictionary.doc2bow(review) for review in reviews]
from gensim import models
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
lda = models.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=10)
corpus_lda = lda[corpus_tfidf]
lda.print_topics(10)
[(0,
  u'0.001*ben + 0.001*sinatra + 0.001*santa + 0.001*henry + 0.001*band + 0.001*william + 0.001*fool + 0.001*tragic + 0.001*favourite + 0.001*bed'),
 (1,
  u'0.002*dentist + 0.002*homeless + 0.002*connery + 0.002*hawn + 0.002*judas + 0.002*bogus + 0.001*dickens + 0.001*hilarity + 0.001*snuff + 0.001*chong'),
 (2,
  u'0.002*freddy + 0.002*mst + 0.001*summary + 0.001*aliens + 0.001*fred + 0.001*broke + 0.001*express + 0.001*cube + 0.001*perfection + 0.001*struck'),
 (3,
  u'0.004*ned + 0.003*kidman + 0.002*nicole + 0.002*chuck + 0.002*hart + 0.002*sabrina + 0.002*miyazaki + 0.002*roberts + 0.002*amitabh + 0.001*educational'),
 (4,
  u'0.002*seagal + 0.002*buffy + 0.002*caprica + 0.002*stargate + 0.002*clown + 0.002*travolta + 0.001*bsg + 0.001*goat + 0.001*insomnia + 0.001*update'),
 (5,
  u'0.003*cinderella + 0.002*envy + 0.002*homicide + 0.002*sucker + 0.002*quantum + 0.002*stallone + 0.002*elvira + 0.002*walt + 0.002*lundgren + 0.001*boobs'),
 (6,
  u'0.002*pickford + 0.002*guaranteed + 0.002*swearing + 0.002*eleniak + 0.002*biko + 0.002*tremendously + 0.001*characterisation + 0.001*arnie + 0.001*radical + 0.001*generate'),
 (7,
  u'0.003*sandler + 0.002*dont + 0.002*buff + 0.002*ustinov + 0.002*brosnan + 0.001*amazon + 0.001*perry + 0.001*link + 0.001*maker + 0.001*adam'),
 (8,
  u'0.002*gandhi + 0.002*scarecrow + 0.002*frankie + 0.002*boxing + 0.002*creep + 0.002*worms + 0.002*mcqueen + 0.002*sellers + 0.002*duchovny + 0.002*appearances'),
 (9,
  u'0.002*sentinel + 0.002*scrooge + 0.002*che + 0.002*robots + 0.002*betty + 0.002*wtf + 0.002*redneck + 0.002*unexplained + 0.002*stiller + 0.002*groups') 

print corpus_lda[0]]
[(1, 0.032862717742657352), (2, 0.061544456899498043), (3, 0.17498689066920223), (5, 0.034931340026756269), (6, 0.01142214861116901), (7, 0.01368447078032208), (8, 0.014051012107502465), (9, 0.58954345105937356)]

最后一个代码显示了文档1中每个主题的分布。现在,我的问题是:我如何将每个主题的权重转换为每个主题的数字变量?

Dataframe中的所需输出:

Document ID  Topic1   Topic2   Topic3.... 
0           0.032    0.062   0.175  

如您所见,这是一个DataFrame,主题为列名,权重为值。

另外,我可以将此主题变量链接回原始文档,这是movie_review吗?

1 个答案:

答案 0 :(得分:1)

我面临着同样的问题。稍作搜索便得出以下代码:

 all_topics = ldamodel.get_document_topics(corpus_lda, minimum_probability=0.0)
 all_topics_csr = gensim.matutils.corpus2csc(all_topics)
 all_topics_numpy = all_topics_csr.T.toarray()
 all_topics_df = pd.DataFrame(all_topics_numpy)

参考:

  1. Efficient transformation of gensim TransformedCorpus data to array

  2. How to get a complete topic distribution for a document using gensim LDA?

  3. https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/topic_methods.ipynb