我使用gensim的LDA来执行主题建模。我知道如何将原始文本数据转换为语料库并获取主题。但是,在我获得主题后,我可以将主题结果标记或添加回原始文档吗?
以下是我的代码:
movie_reviews = pd.read_csv(data_path + 'movie_review.tsv',header=0,delimiter='\t',quoting=3)
reviews = []
for i in range(len(movie_reviews['review'])):
reviews.append(review_to_words(movie_reviews['review']
[i],stops=stopwords.words('english')))
from gensim import corpora
dictionary = corpora.Dictionary(reviews)
corpus = [dictionary.doc2bow(review) for review in reviews]
from gensim import models
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
lda = models.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=10)
corpus_lda = lda[corpus_tfidf]
lda.print_topics(10)
[(0,
u'0.001*ben + 0.001*sinatra + 0.001*santa + 0.001*henry + 0.001*band + 0.001*william + 0.001*fool + 0.001*tragic + 0.001*favourite + 0.001*bed'),
(1,
u'0.002*dentist + 0.002*homeless + 0.002*connery + 0.002*hawn + 0.002*judas + 0.002*bogus + 0.001*dickens + 0.001*hilarity + 0.001*snuff + 0.001*chong'),
(2,
u'0.002*freddy + 0.002*mst + 0.001*summary + 0.001*aliens + 0.001*fred + 0.001*broke + 0.001*express + 0.001*cube + 0.001*perfection + 0.001*struck'),
(3,
u'0.004*ned + 0.003*kidman + 0.002*nicole + 0.002*chuck + 0.002*hart + 0.002*sabrina + 0.002*miyazaki + 0.002*roberts + 0.002*amitabh + 0.001*educational'),
(4,
u'0.002*seagal + 0.002*buffy + 0.002*caprica + 0.002*stargate + 0.002*clown + 0.002*travolta + 0.001*bsg + 0.001*goat + 0.001*insomnia + 0.001*update'),
(5,
u'0.003*cinderella + 0.002*envy + 0.002*homicide + 0.002*sucker + 0.002*quantum + 0.002*stallone + 0.002*elvira + 0.002*walt + 0.002*lundgren + 0.001*boobs'),
(6,
u'0.002*pickford + 0.002*guaranteed + 0.002*swearing + 0.002*eleniak + 0.002*biko + 0.002*tremendously + 0.001*characterisation + 0.001*arnie + 0.001*radical + 0.001*generate'),
(7,
u'0.003*sandler + 0.002*dont + 0.002*buff + 0.002*ustinov + 0.002*brosnan + 0.001*amazon + 0.001*perry + 0.001*link + 0.001*maker + 0.001*adam'),
(8,
u'0.002*gandhi + 0.002*scarecrow + 0.002*frankie + 0.002*boxing + 0.002*creep + 0.002*worms + 0.002*mcqueen + 0.002*sellers + 0.002*duchovny + 0.002*appearances'),
(9,
u'0.002*sentinel + 0.002*scrooge + 0.002*che + 0.002*robots + 0.002*betty + 0.002*wtf + 0.002*redneck + 0.002*unexplained + 0.002*stiller + 0.002*groups')
print corpus_lda[0]]
[(1, 0.032862717742657352), (2, 0.061544456899498043), (3, 0.17498689066920223), (5, 0.034931340026756269), (6, 0.01142214861116901), (7, 0.01368447078032208), (8, 0.014051012107502465), (9, 0.58954345105937356)]
最后一个代码显示了文档1中每个主题的分布。现在,我的问题是:我如何将每个主题的权重转换为每个主题的数字变量?
Dataframe中的所需输出:
Document ID Topic1 Topic2 Topic3....
0 0.032 0.062 0.175
如您所见,这是一个DataFrame,主题为列名,权重为值。
另外,我可以将此主题变量链接回原始文档,这是movie_review吗?
答案 0 :(得分:1)
我面临着同样的问题。稍作搜索便得出以下代码:
all_topics = ldamodel.get_document_topics(corpus_lda, minimum_probability=0.0)
all_topics_csr = gensim.matutils.corpus2csc(all_topics)
all_topics_numpy = all_topics_csr.T.toarray()
all_topics_df = pd.DataFrame(all_topics_numpy)
参考: