使用Python Latent Dirichlet Allocation(LDA)模型预测新文本

时间:2018-01-17 10:17:34

标签: python predict lda topic-modeling

我使用LDA包来为大量文本文档建模主题。我的代码的简化(!)示例(我删除了所有其他清理步骤,词形还原,生物图等)如下所示,我对目前的结果感到满意。但是现在我很难编写代码来预测新文本。我无法在LDA的文档中找到有关保存/加载/预测选项的任何参考。我可以在我的设置中添加一个新文本并再次适合它,但这是一种昂贵的方式。

我知道我可以用gensim做到这一点。但不知何故,gensim模型的结果不那么令人印象深刻,所以我坚持使用我最初的LDA模型。

欢迎任何建议!

我的代码:

import lda
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import nltk
from nltk.corpus import stopwords
stops = set(stopwords.words('english'))  # nltk stopwords list

documents = ["Liz Dawn: Coronation Street's Vera Duckworth dies at 77",
            'Game of Thrones stars Kit Harington and Rose Leslie to wed',
            'Tony Booth: Till Death Us Do Part actor dies at 85',
            'The Child in Time: Mixed reaction to Benedict Cumberbatch drama',
            "Alanna Baker: The Cirque du Soleil star who 'ran off with the circus'",
            'How long can The Apprentice keep going?',
            'Strictly Come Dancing beats X Factor for Saturday viewers',
            "Joe Sugg: 8 things to know about one of YouTube's biggest stars",
            'Sir Terry Wogan named greatest BBC radio presenter',
            "DJs celebrate 50 years of Radio 1 and 2'"]

clean_docs = []
for doc in documents:
    # set all to lower case and tokenize
    tokens = nltk.tokenize.word_tokenize(doc.lower())
    # remove stop words
    texts = [i for i in tokens if i not in stops]
    clean_docs.append(texts)

# join back all tokens to create a list of docs
docs_vect = [' '.join(txt) for txt in clean_docs]

cvectorizer = CountVectorizer(max_features=10000, stop_words=stops)
cvz = cvectorizer.fit_transform(docs_vect)

n_topics = 3
n_iter = 2000
lda_model = lda.LDA(n_topics=n_topics, n_iter=n_iter)
X_topics = lda_model.fit_transform(cvz)

n_top_words = 3
topic_summaries = []

topic_word = lda_model.topic_word_  # get the topic words
vocab = cvectorizer.get_feature_names()
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    topic_summaries.append(' '.join(topic_words))
    print('Topic {}: {}'.format(i+1, ' '.join(topic_words)))

# How to predict a new document?
new_text = '50 facts about Radio 1 & 2 as they turn 50'

0 个答案:

没有答案