我正在开展一个项目,我想使用Latent Dirichlet Allocation来从大量文章中提取主题。
我的代码是:
import gensim
import csv
import json
import glob
from gensim import corpora, models
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from time import gmtime, strftime
tokenizer = RegexpTokenizer(r'\w+')
cachedStopWords = set(stopwords.words("english"))
body = []
processed = []
with open('/…/file.json') as j:
data = json.load(j)
for i in range(0,len(data)):
body.append(data[i]['text'].lower())
for entry in body:
row = tokenizer.tokenize(entry)
processed.append([word for word in row if word not in cachedStopWords])
dictionary = corpora.Dictionary(processed)
corpus = [dictionary.doc2bow(text) for text in processed]
lda = gensim.models.ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=50, update_every=1, passes=1)
topics = lda.show_topics(num_topics=50, num_words=8)
other_doc = "After being jailed for life in 1964, Nelson Mandela became a worldwide symbol of resistance to apartheid. But his opposition to racism began many years before."
print lda[other_doc]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site- packages/gensim/models/ldamodel.py", line 714, in __getitem__
gamma, _ = self.inference([bow])
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site
packages/gensim/models/ldamodel.py", line 361, in inference ids = [id for id, _ in doc]
ValueError: need more than 1 value to unpack
我还试图以3种不同的方式使用LdaMulticore:
lda = gensim.models.LdaMulticore(corpus, id2word=dictionary, num_topics=100, workers=3)
lda = gensim.models.ldamodel.LdaMulticore(corpus, id2word=dictionary, num_topics=100, workers=3)
lda = models.LdaMulticore(corpus, id2word=dictionary, num_topics=100, workers=3)
每次我收到此错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute ‘LdaMulticore'
有什么想法吗?
提前谢谢。
答案 0 :(得分:3)
您必须转换回相空间。
http://radimrehurek.com/gensim/tut3.html#similarity-interface
vec_bow = dictionary.doc2bow(other_doc.lower().split())
vec_lsi = lda[vec_bow] # convert the query to LSI space
答案 1 :(得分:0)
我意识到这已经过时了,但我遇到了同样的问题。您可能指向较旧版本的Gensim。您必须确保使用版本&gt; = 0.10.2。
使用“easy_install -U gensim”更新,然后确保您的IDE看到更新的库。