计算语料库中所有文档的LDA主题权重

时间:2016-05-27 15:40:48

标签: python lda gensim corpus

我计算了我的LDA模型,我检索了我的主题,现在我正在寻找计算语料库中每个主题的权重/百分比的方法。令人惊讶的是我找不到这样做的方法,到目前为止我的代码看起来像:

delete

到目前为止,我在其他论坛上看到的是:

A(A const& ) = delete;
A& operator=(A const& ) = delete;

但是我收到了群集2中的错误:## Libraries to download from nltk.tokenize import RegexpTokenizer from nltk.corpus import stopwords from nltk.stem.porter import PorterStemmer from gensim import corpora, models import gensim ## Tokenizing tokenizer = RegexpTokenizer(r'\w+') # create English stop words list en_stop = stopwords.words('english') # Create p_stemmer of class PorterStemmer p_stemmer = PorterStemmer() import json import nltk import re import pandas appended_data = [] #for i in range(20014,2016): # df0 = pandas.DataFrame([json.loads(l) for l in open('SDM_%d.json' % i)]) # appended_data.append(df0) for i in range(2005,2016): if i > 2013: df0 = pandas.DataFrame([json.loads(l) for l in open('SDM_%d.json' % i)]) appended_data.append(df0) df1 = pandas.DataFrame([json.loads(l) for l in open('Scot_%d.json' % i)]) df2 = pandas.DataFrame([json.loads(l) for l in open('APJ_%d.json' % i)]) df3 = pandas.DataFrame([json.loads(l) for l in open('TH500_%d.json' % i)]) df4 = pandas.DataFrame([json.loads(l) for l in open('DRSM_%d.json' % i)]) appended_data.append(df1) appended_data.append(df2) appended_data.append(df3) appended_data.append(df4) appended_data = pandas.concat(appended_data) # doc_set = df1.body doc_set = appended_data.body # list for tokenized documents in loop texts = [] # loop through document list for i in doc_set: # clean and tokenize document string raw = i.lower() tokens = tokenizer.tokenize(raw) # remove stop words from tokens stopped_tokens = [i for i in tokens if not i in en_stop] # add tokens to list texts.append(stopped_tokens) # turn our tokenized documents into a id <-> term dictionary dictionary = corpora.Dictionary(texts) # convert tokenized documents into a document-term matrix corpus = [dictionary.doc2bow(text) for text in texts] # generate LDA model ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=15, id2word = dictionary, passes=50) ldamodel.save("model.lda0") 。知道为什么吗?

1 个答案:

答案 0 :(得分:4)

您需要在lda函数中声明最小概率为零:

ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=15, id2word = dictionary, passes=50, minimum_probability=0)

此外,您可以通过以下方式获取所有文章的主题分发:

for i in range(len(doc_set)):
    print(ldamodel[corpus[i]])