当我在大约1600万个文档的完整语料库上运行Gensim LDAMallet模型时,出现CalledProcessError“非零退出状态1”错误。 有趣的是,如果我在约160,000个文档的测试语料库上运行完全相同的代码,则该代码可以很好地运行。由于在我的小语料库上可以正常工作,我倾向于认为代码很好,但是我不确定还有什么会/可能导致此错误...
我尝试按照建议的here编辑mallet.bat文件,但无济于事。 我还仔细检查了路径,但是鉴于它适用于较小的语料库,所以这不是问题。
id2word = corpora.Dictionary(lists_of_words)
corpus =[id2word.doc2bow(doc) for doc in lists_of_words]
num_topics = 30
os.environ.update({'MALLET_HOME':r'C:/mallet-2.0.8/'})
mallet_path = r'C:/mallet-2.0.8/bin/mallet'
ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=num_topics, id2word=id2word)
这是完整的追溯和错误:
File "<ipython-input-57-f0e794e174a6>", line 8, in <module>
ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=num_topics, id2word=id2word)
File "C:\ProgramData\Anaconda3\lib\site-packages\gensim\models\wrappers\ldamallet.py", line 132, in __init__
self.train(corpus)
File "C:\ProgramData\Anaconda3\lib\site-packages\gensim\models\wrappers\ldamallet.py", line 273, in train
self.convert_input(corpus, infer=False)
File "C:\ProgramData\Anaconda3\lib\site-packages\gensim\models\wrappers\ldamallet.py", line 262, in convert_input
check_output(args=cmd, shell=True)
File "C:\ProgramData\Anaconda3\lib\site-packages\gensim\utils.py", line 1918, in check_output
raise error
CalledProcessError: Command 'C:/mallet-2.0.8/bin/mallet import-file --preserve-case --keep-sequence --remove-stopwords --token-regex "\S+" --input C:\Users\user\AppData\Local\Temp\2\e1ba4a_corpus.txt --output C:\Users\user\AppData\Local\Temp\2\e1ba4a_corpus.mallet' returned non-zero exit status 1.
答案 0 :(得分:1)
很高兴您找到我的帖子,很抱歉,该帖子对您不起作用。我遇到该错误的原因有多种,主要是Java未安装属性,并且路径未调用环境变量。
由于您的代码在较小的数据集上运行,因此我将首先查看您的数据。 Mallet是挑剔的,因为它仅接受可能已打空,标点或浮点的最干净的数据。
您是否采用了词典的样本大小,还是传递了整个数据集?
这基本上就是它的工作:将句子变成单词-单词变成数字-然后按频率计数:
[(3,1), (13,1), (37,1)]
单词3(“辅助”)出现1次。 单词13(“付款”)出现1次。 单词37(“帐户”)出现1次。
然后,您的LDA会查看一个单词,并根据它在字典中所有其他单词出现的频率进行评分,并且会对整个词典进行评分,因此,如果您让它查看成千上万个单词,它就是会很快崩溃。
这是我实施短槌和缩小字典的方式,不包括词干或其他预处理步骤:
# we create a dictionary of all the words in the csv by iterating through
# contains the number of times a word appears in the training set.
dictionary = gensim.corpora.Dictionary(processed_docs[:])
count = 0
for k, v in dictionary.iteritems():
print(k, v)
count += 1
if count > 10:
break
# we want to throw out words that are so frequent that they tell us little about the topic
# as well as words that are too infrequent >15 rows then keep just 100,000 words
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)
# the words become numbers and are then counted for frequency
# consider a random row 4310 - it has 27 words word indexed 2 shows up 4 times
# preview the bag of words
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
bow_corpus[4310]
os.environ['MALLET_HOME'] = 'C:\\mallet\\mallet-2.0.8'
mallet_path = 'C:\\mallet\\mallet-2.0.8\\bin\\mallet'
ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=bow_corpus, num_topics=20, alpha =.1,
id2word=dictionary, iterations = 1000, random_seed = 569356958)
我也将您的ldamallet分离到一个单独的单元中,因为编译时间很慢,尤其是对于这样大小的数据集。我希望这能帮助我知道您是否仍然遇到错误:)