无法使用LDA训练模型对主题进行分类

时间:2019-09-02 07:33:40

标签: python nlp lda

我使用Gensim创建了一个LDA模型,为此我首先从3到10的num_topics进行了迭代,然后根据pyLDAvis图,在最终的lda模型中选择了n = 3。

.theme-dark { background: #000; }

.theme-light { background: #fff; }

$themes: (
  light: (
    primary: #fff,
    secondary: #bfbfbf,
  ),
  dark: (
    primary: #000,
    secondary: #1a1a1a,
  ),
);

@function themed($key) {
  @return map-get($theme-map, $key);
}

@mixin themify($themes: $themes) {
  @each $theme, $map in $themes {

    .theme-#{$theme} & { /* HOW TO USE CSS-MODULES HERE ?*/
      $theme-map: () !global;
      @each $key, $submap in $map {
        $value: map-get(map-get($themes, $theme), '#{$key}');
        $theme-map: map-merge($theme-map, ($key: $value)) !global;
      }

      @content;
      $theme-map: null !global;
    }

  }
}

现在我拥有训练有素的模型,但是我想知道如何在用于训练的文档以及新的看不见的文档上使用该模型来分配主题

我正在使用以下代码执行此操作,但收到如下错误:

import glob
import sys
sys.path.append('/Users/tcssig/Documents/NLP_code_base/Doc_Similarity')
import normalization
from gensim.models.coherencemodel import CoherenceModel
datalist = []

for filename in glob.iglob('/Users/tcssig/Documents/Speech_text_files/*.*'):
    text = open(filename).readlines()
    text = normalization.normalize_corpus(text, only_text_chars=True, tokenize=True)
    datalist.append(text)

datalist = [datalist[i][0] for i in range(len(datalist))]

from gensim import models,corpora
import spacy
dictionary = corpora.Dictionary(datalist)
num_topics = 3
Lda = models.LdaMulticore

#lda= Lda(doc_term_matrix, num_topics=num_topics,id2word = dictionary, passes=20,chunksize=2000,random_state=3)

doc_term_matrix = [dictionary.doc2bow(doc) for doc in datalist]

dictionary = corpora.Dictionary(datalist)
import numpy as np 
import pandas as pd
import spacy
import re
from tqdm._tqdm_notebook import tqdm_notebook,tnrange,tqdm
from collections import Counter,OrderedDict
from gensim import models,corpora
from gensim.summarization import summarize,keywords
import warnings
import pyLDAvis.gensim
import matplotlib.pyplot as plt
import seaborn as sns

Lda = models.LdaMulticore
coherenceList_umass = []
coherenceList_cv = []
num_topics_list = np.arange(3,10)
for num_topics in tqdm(num_topics_list):
    lda= Lda(doc_term_matrix, num_topics=num_topics,id2word = dictionary, passes=20,chunksize=4000,random_state=43)
    cm = CoherenceModel(model=lda, corpus=doc_term_matrix, dictionary=dictionary, coherence='u_mass')
    coherenceList_umass.append(cm.get_coherence())
    cm_cv = CoherenceModel(model=lda, corpus=doc_term_matrix, texts=datalist, dictionary=dictionary, coherence='c_v')
    coherenceList_cv.append(cm_cv.get_coherence())
    vis = pyLDAvis.gensim.prepare(lda, doc_term_matrix, dictionary)
    pyLDAvis.save_html(vis,'pyLDAvis_%d.html' %num_topics)


plotData = pd.DataFrame({'Number of topics':num_topics_list,'CoherenceScore':coherenceList_umass})
f,ax = plt.subplots(figsize=(10,6))
sns.set_style("darkgrid")
sns.pointplot(x='Number of topics',y= 'CoherenceScore',data=plotData)
plt.axhline(y=-3.9)
plt.title('Topic coherence')
plt.savefig('Topic coherence plot.png')

#################################################################
#################################################################

lda_final= Lda(doc_term_matrix, num_topics=3,id2word = dictionary, passes=20,chunksize=4000,random_state=43)

lda_final.save('lda_final')

dictionary.save('dictionary')

corpora.MmCorpus.serialize('doc_term_matrix.mm', doc_term_matrix)


a = lda_final.show_topics(num_topics=3,formatted=False,num_words=10)
b = lda_final.top_topics(doc_term_matrix,dictionary=dictionary,topn=10)


topic2wordb = {}
topic2csb = {}
topic2worda = {}
topic2csa = {}
num_topics =lda_final.num_topics
cnt =1

for ws in b:
    wset = set(w[1] for w in ws[0])
    topic2wordb[cnt] = wset
    topic2csb[cnt] = ws[1]
    cnt +=1

for ws in a:
    wset = set(w[0]for w in ws[1])
    topic2worda[ws[0]+1] = wset

for i in range(1,num_topics+1):
    for j in range(1,num_topics+1):  
        if topic2worda[i].intersection(topic2wordb[j])==topic2worda[i]:
            topic2csa[i] = topic2csb[j]

print('the final data block')
finalData = pd.DataFrame([],columns=['Topic','words'])
finalData['Topic']=topic2worda.keys()
finalData['Topic'] = finalData['Topic'].apply(lambda x: 'Topic'+str(x))
finalData['words']=topic2worda.values()
finalData['cs'] = topic2csa.values()
finalData.sort_values(by='cs',ascending=False,inplace=True)
finalData.to_csv('CoherenceScore.csv')
print(finalData)

错误消息:

unseen_document = 'How a Pentagon deal became an identity crisis for Google'

text = normalization.normalize_corpus(unseen_document, only_text_chars=True, tokenize=True)

bow_vector = dictionary.doc2bow(text)

corpora.MmCorpus.serialize('x.bow_vector', bow_vector)

corpus = [dictionary.doc2bow(text)]

x = lda_final[corpus]

1 个答案:

答案 0 :(得分:0)

在这一行

corpus = [dictionary.doc2bow(text)]

您正在创建BOW向量的列表。您需要查找这些矢量,而不是列表,例如

for v in corpus:
    print(lda_final[v])

将显示文档的主题概率分布。

请参见gensim docs